1

There are several tutorials that describe how to scrape websites with request and cheerio. In these tutorials they send the output to the console or stream the DOM with fs into a file as seen in the example below.

request(link, function (err, resp, html) {
  if (err) return console.error(err)
  var $ = cheerio.load(html),
      img = $('#img_wrapper').data('src');
  console.log(img);
}).pipe(fs.createWriteStream('img_link.txt'));

But what if I would like to process the output during script execution? How can I access the output or send it back to the calling function? I, of course, could load img_link.txt and get the information from there, but this would be to costly and doesn't make sense.

Alex Maiburg
  • 690
  • 5
  • 10

2 Answers2

1

Remove the pipe all together.

request(link, function (err, resp, html) {
  if (err) return console.error(err)

  var $ = cheerio.load(html);
  var img = $('#img_wrapper').data('src'); // the var img now has the src attr of some image

  return img; // Will return the src attr
});

Update

By your comments, it seems like your request function is working as expected, but the problem is rather accessing the data from another module.

I suggest you read this Purpose of Node.js module.exports and how you use it.

This is also a good resource article describing how require and exports are working.

  • Put the code above in a module
  • Use the module.exports
  • Require the module in another file
Community
  • 1
  • 1
aludvigsen
  • 5,893
  • 3
  • 26
  • 37
  • If I do this, I get an `undefined`. – Alex Maiburg Apr 24 '14 at 08:19
  • If you `console.log(img)`, do you get the src attr? – aludvigsen Apr 24 '14 at 08:43
  • Yes, I do. But as I am fairly new to node.js I start to understand what my real problem is: When I build a module with a callback, require it in another module and use it, how would I access the data from the calling module? If I just return the data as in your example I always get an `undefined`. – Alex Maiburg Apr 24 '14 at 09:15
  • Yes, it looks like your problem is the way `module.exports` works in node.js. I updated my answer with a few resources. – aludvigsen Apr 24 '14 at 09:28
  • Thanks for the information and your patience with a node.js noob! I will check it out. Regards – Alex Maiburg Apr 24 '14 at 09:39
1

You can wrap request in a function that will callback with html

function(link, callback){
  request(link, function(err, im, body){
    callback(err, body);
  });
});

Then assign it to exports and use in any other module.

Eugene Kostrikov
  • 6,799
  • 6
  • 23
  • 25
  • I created the module and required it in the calling module and logged the output to the console. But now I have the same problem. I start to understand, that I have a general lack of understanding how to access the data within a module call. Returning the data or assign it to a global variable didn't do the trick! So what is the best way to get the information back into the global scope? – Alex Maiburg Apr 24 '14 at 09:23
  • In general, you should wait for all async tasks to end, e.g. with request it should first do a round-trip to the url you are requesting and only after that trip you will have `body`. If you try to `console.log(body)` before (read this as 'outside of request callback function') the request finished you get `undefined`. Read some tutorial on async programming in Node. It's the very basic concept that is hard to get for the first time but very simple once you understand it. – Eugene Kostrikov Apr 24 '14 at 09:37