Scraping Nodejs

Question

I want to scrape page "https://www.ukr.net/ua/news/sport.html" with Nodejs. I`m trying to make basic get request with 'request' npm module, here is example:

const inspect = require('eyespect').inspector();
const request = require('request');
const url = 'https://www.ukr.net/news/dat/sport/2/';
const options = {
    method: 'get',
    json: true,
    url: url
};

request(options,  (err, res, body) => {
    if (err) {
        inspect(err, 'error posting json');
        return
    }
    const headers = res.headers;
    const statusCode = res.statusCode;
    inspect(headers, 'headers');
    inspect(statusCode, 'statusCode');
    inspect(body, 'body');
});

But in response body I only get

body: '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 
Transitional//EN">\n<html>\n<head>\n<META HTTP-EQUIV="expires" 
CONTENT="Wed, 26 Feb 1997 08:21:57 GMT">\n<META HTTP-EQUIV=Refresh
CONTENT="10">\n<meta HTTP-EQUIV="Content-type" CONTENT="text/html; 
charset=utf-8">\n<title>www.ukr.net</title>\n</head>\n<body>\n
Идет загрузка, подождите .....\n</body>\n</html>'

If I make get request from Postman, I get exactly what I need:

Please help me guys.

`Идет загрузка, подождите .....` = `loading, please wait....` - the page you are trying to scrape has elements that are loaded dynamically, so your initial request comes back with the "loading" message instead - maybe you could use something like phantom js to render the page for you? http://stackoverflow.com/a/31059035/459517 - Postman is probably doing something like this automatically. — Robbie, Feb 04 '17 at 19:20

score 1 · Accepted Answer · answered Feb 05 '17 at 09:01

You might have been blocked by bot protection - this can be checked with curl.

curl -vL https://www.ukr.net/news/dat/sport/2/

curl seem to get the result and if curl is working then there is probably something missing in the request from node, a solution could be to mimic a browser of your choice.

For example - Here is an example of Chrome-like request taken from developer-tools:

deriving the following options for the request:

const options = {
    method: 'get',
    json: true,
    url: url,
    gzip: true,
    headers: {
        "Host": "www.ukr.net",
        "Pragma": "no-cache",
        "Cache-Control": "no-cache",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, sdch, br",
        "Accept-Language": "en-US,en;q=0.8"
    }
};

score 1 · Answer 2 · answered Feb 20 '17 at 23:14

If you have experience in jquery, there a library to access of the HTML, for example.

Markup example we'll be using:

<ul id="fruits">
  <li class="apple">Apple</li>
  <li class="orange">Orange</li>
  <li class="pear">Pear</li>
</ul>

First you need to load in the HTML. This step in jQuery is implicit, since jQuery operates on the one, baked-in DOM. With Cheerio, we need to pass in the HTML document.

var cheerio = require('cheerio');

$ = cheerio.load('<ul id="fruits">...</ul>');

Selectors

$('ul .pear').attr('class')

probably you can make something like this.

request(options,  (err, res, body) => {

  var $ = cheerio.load(html);

})

https://github.com/cheeriojs/cheerio

Scraping Nodejs

2 Answers2