0

I want to get the body of web-page from a list of more than 1000 urls (my goal is to do scraping using cheerio then). The problem is that I get a weird GUNZIP result and I can't get the content of the body tag. This is the code that I'm using (I cant use a simple "request" cause it misses some request)

var async = require('async');
var fetch = require('isomorphic-unfetch');
const cheerio = require('cheerio');

let urls= // reading a list of ~1000 URLs from JSON file

async.mapLimit(urls, 1, async function(url) {
  const response = await fetch(url);
  return response.body
}, (err, results) => {
     if (err) throw err
     console.log(results);
});
Davide Buoso
  • 140
  • 1
  • 1
  • 5
  • I think more info is required. What is a weird gunzip result for example? Is this for all, or just one URL? Could it be related to this: [link](https://stackoverflow.com/questions/12148948/how-do-i-ungzip-decompress-a-nodejs-requests-module-gzip-response-body) – P Burke Nov 21 '17 at 15:42

1 Answers1

0

The problem is that I get a weird GUNZIP result

use zlib,

var zlib = require('zlib');


async.mapLimit(urls, 1, async function(url) {
  const response = await fetch(url);

  zlib.gunzip(response.body, function(err, dezipped) {
      return (dezipped.toString());
  });
}, (err, results) => {
     if (err) throw err
     console.log(results);
});

then purseed your parsing with cheerio :)
i hope this helps.

Taki
  • 17,320
  • 4
  • 26
  • 47