Web scraping, also known as data extraction, is the process of retrieving, thus scraping
,
data from a website. Developers use certain HTTP methods to manipulate data, usually using APIs.
Below we’re going to go over how to get data from my social tracker blog, mb4.in.
A cheat-sheet containing ES2015 [ES6] tips, tricks, best practices and code snippet examples for your day to day workflow.
Our first example can be done from the command line. Fire up your terminal, and try a cURL:
curl -X GET \
'https://mb4.in/?rest_route=/wp/v2/posts&per_page=1'
Your results should something similar to:
[
{
"id": 14085,
"date": "2020-04-08T23:38:11",
"date_gmt": "2020-04-09T03:38:11",
"guid": {
"rendered": "https://mb4.in/?p=14085"
},
"modified": "2020-04-08T23:38:11",
"modified_gmt": "2020-04-09T03:38:11",
"slug": "daewon-songs-greatest-manuals-at-the-berrics",
"status": "publish",
"type": "post",
"link": "https://mb4.in/daewon-songs-greatest-manuals-at-the-berrics/",
"title": {
"rendered": "Daewon Song’s Greatest Manuals At The Berrics"
},
"content": {
"rendered": "<div class=\"jetpack-video-wrapper\"><span class=\"embed-youtube\" style=\"text-align:center; display: block;\"><iframe class='youtube-player' type='text/html' width='640' height='360' src='https://www.youtube.com/embed/dzfdKQO8_Vc?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent' allowfullscreen='true' style='border:0;'></iframe></span></div>\n",
"protected": false
},
"excerpt": {
"rendered": "",
"protected": false
},
"author": 1,
"featured_media": 14087,
"comment_status": "closed",
"ping_status": "closed",
"sticky": false,
"template": "",
"format": "standard",
"meta": {
"spay_email": ""
},
"categories": [
1
],
"tags": [
46
],
"jetpack_featured_media_url": "https://mb4.in/wp-content/uploads/2020/04/dzfdKQO8_Vc.jpg",
"_links": {
"self": [
{
"href": "https://mb4.in/wp-json/wp/v2/posts/14085"
}
],
"collection": [
{
"href": "https://mb4.in/wp-json/wp/v2/posts"
}
],
"about": [
{
"href": "https://mb4.in/wp-json/wp/v2/types/post"
}
],
"author": [
{
"embeddable": true,
"href": "https://mb4.in/wp-json/wp/v2/users/1"
}
],
"replies": [
{
"embeddable": true,
"href": "https://mb4.in/wp-json/wp/v2/comments?post=14085"
}
],
"version-history": [
{
"count": 1,
"href": "https://mb4.in/wp-json/wp/v2/posts/14085/revisions"
}
],
"predecessor-version": [
{
"id": 14086,
"href": "https://mb4.in/wp-json/wp/v2/posts/14085/revisions/14086"
}
],
"wp:featuredmedia": [
{
"embeddable": true,
"href": "https://mb4.in/wp-json/wp/v2/media/14087"
}
],
"wp:attachment": [
{
"href": "https://mb4.in/wp-json/wp/v2/media?parent=14085"
}
],
"wp:term": [
{
"taxonomy": "category",
"embeddable": true,
"href": "https://mb4.in/wp-json/wp/v2/categories?post=14085"
},
{
"taxonomy": "post_tag",
"embeddable": true,
"href": "https://mb4.in/wp-json/wp/v2/tags?post=14085"
}
],
"curies": [
{
"name": "wp",
"href": "https://api.w.org/{rel}",
"templated": true
}
]
}
}
]
This is the standard response from a Wordpress website. It contains the entire post, the title, content, url, taxonomies and references to other data.
I personally like to use wget
when making command line requests, here’s how that’d
look (more human readable):
wget --quiet \
--method GET \
--header 'Cache-Control: no-cache' \
--output-document \
- 'https://mb4.in/?rest_route=/wp/v2/posts&per_page=1'
We’ll start with PHP. There’s a few different ways to GET data from a remote URL, in this instance,
https://mb4.in/?rest_route=/wp/v2/posts&per_page=1
. This will return the latest post from my blog.
First, lets see how its done with PHP cURL:
<?php
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => "https://mb4.in/?rest_route=/wp/v2/posts&per_page=1",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => "GET",
CURLOPT_HTTPHEADER => array(
"Cache-Control: no-cache"
),
));
$response = curl_exec($curl);
$err = curl_error($curl);
curl_close($curl);
if ($err) {
echo "cURL Error #:" . $err;
} else {
echo $response;
}
If you place this into a php document and run a local server, say: php -S localhost:9000
,
then hit that URL, in your browser you should the response:
Next example uses the pecl_http
library:
<?php
$client = new http\Client;
$request = new http\Client\Request;
$request->setRequestUrl('https://mb4.in/');
$request->setRequestMethod('GET');
$request->setQuery(new http\QueryString(array(
'rest_route' => '/wp/v2/posts',
'per_page' => '1'
)));
$request->setHeaders(array(
'Cache-Control' => 'no-cache'
));
$client->enqueue($request)->send();
$response = $client->getResponse();
echo $response->getBody();
and lastly, HTTPRequest:
<?php
$request = new HttpRequest();
$request->setUrl('https://mb4.in/');
$request->setMethod(HTTP_METH_GET);
$request->setQueryData(array(
'rest_route' => '/wp/v2/posts',
'per_page' => '1'
));
$request->setHeaders(array(
'Cache-Control' => 'no-cache'
));
try {
$response = $request->send();
echo $response->getBody();
} catch (HttpException $ex) {
echo $ex;
}
I love how simple and clean code is in Python
. Here’s how to make a request using
the default http.client library in Python 3:
import http.client
conn = http.client.HTTPConnection("mb4,in")
headers = {
'Cache-Control': "no-cache"
}
conn.request("GET", "", headers=headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
and if you want things even simpler, give Requests
a spin:
import requests
url = "https://mb4.in/"
querystring = {"rest_route":"/wp/v2/posts","per_page":"1"}
headers = {
'Cache-Control': "no-cache"
}
response = requests.request("GET", url, headers=headers, params=querystring)
print(response.text)
For Javascript
, not including Node, there’s two major methods. These examples will output their
results in the console. First is AJAX
using jQuery:
var settings = {
"async": true,
"crossDomain": true,
"url": "https://mb4.in/?rest_route=/wp/v2/posts&per_page=1",
"method": "GET",
"headers": {
"Cache-Control": "no-cache"
}
}
$.ajax(settings).done(function (response) {
console.log(response);
});
This second example works intependent of jQuery:
var data = null;
var xhr = new XMLHttpRequest();
xhr.withCredentials = true;
xhr.addEventListener("readystatechange", function () {
if (this.readyState === 4) {
console.log(this.responseText);
}
});
xhr.open("GET", "https://mb4.in/?rest_route=/wp/v2/posts&per_page=1");
xhr.setRequestHeader("Cache-Control", "no-cache");
xhr.send(data);
Node isn’t too far off from how its done in Javascript, for obvious reasons. Let’s start with the native Node method first:
var http = require("https");
var options = {
"method": "GET",
"hostname": [
"mb4",
"in"
],
"path": [
""
],
"headers": {
"Cache-Control": "no-cache"
}
};
var req = http.request(options, function (res) {
var chunks = [];
res.on("data", function (chunk) {
chunks.push(chunk);
});
res.on("end", function () {
var body = Buffer.concat(chunks);
console.log(body.toString());
});
});
req.end();
It’s far cleaner and simpler to just use the request
library:
var request = require("request");
var options = { method: 'GET',
url: 'https://mb4.in/',
qs: { rest_route: '/wp/v2/posts', per_page: '1' },
headers:
{ 'Cache-Control': 'no-cache' } };
request(options, function (error, response, body) {
if (error) throw new Error(error);
console.log(body);
});
Lastly is Ruby:
require 'uri'
require 'net/http'
url = URI("https://mb4.in/?rest_route=/wp/v2/posts&per_page=1")
http = Net::HTTP.new(url.host, url.port)
request = Net::HTTP::Get.new(url)
request["Cache-Control"] = 'no-cache'
response = http.request(request)
puts response.read_body
I’m not aan iOS developer, but I thought I’d also include this, mostly to compare to how wild it looks compared to other languages.
#import <Foundation/Foundation.h>
NSDictionary *headers = @{ @"Cache-Control": @"no-cache" };
NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:[NSURL URLWithString:@"https://mb4.in/?rest_route=/wp/v2/posts&per_page=1"]
cachePolicy:NSURLRequestUseProtocolCachePolicy
timeoutInterval:10.0];
[request setHTTPMethod:@"GET"];
[request setAllHTTPHeaderFields:headers];
NSURLSession *session = [NSURLSession sharedSession];
NSURLSessionDataTask *dataTask = [session dataTaskWithRequest:request
completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) {
if (error) {
NSLog(@"%@", error);
} else {
NSHTTPURLResponse *httpResponse = (NSHTTPURLResponse *) response;
NSLog(@"%@", httpResponse);
}
}];
[dataTask resume];