2

From node, I'd like to get all image urls (src attribute from img tags) from an external web page.

I started by considering phantonjs, but didn't like that it's not really integrated into node (i.e. it runs in an external process).

Next, I tried to use the request module and cheerio. This works great, except I have to deal with relative image urls. E.g.

<img src='http//example.com/i.jpg'>
<img src='/i.jpg'>
<img src='i.jpg'>
<img src='../images/i.jpg'>

I can deal with that, but I'm wondering if there's an easier way?

three-cups
  • 4,375
  • 3
  • 31
  • 41
  • 2
    I imagine request + cheerio is probably the easiest way. You could also use jquery + js-dom instead – Andbdrew Jun 06 '13 at 21:19
  • 1
    Could these relative-to-absolute methods help you out? http://stackoverflow.com/questions/7544550/javascript-regex-to-change-all-relative-urls-to-absolute – Vegar Jun 06 '13 at 21:38
  • Looks like node's [url](http://nodejs.org/docs/latest/api/url.html) module may do the trick here. – three-cups Jun 06 '13 at 22:08
  • simple thing you can try the Headless browser and with Nodejs try Puppeteer – Truong Nguyen Sep 10 '19 at 13:27

1 Answers1

8

I ended up using the request node module along with cheerio and url. Here's what I ended up doing (please note, this is mvp code, not production quality):

app.get('/scrape-images', function(req, res) {
  request(req.query.url, function (error, response, body) {
    if (!error && response.statusCode == 200) {
      var $ = cheerio.load(body);
      var reqUrl = url.parse(req.query.url);

      res.send($('img').map(function(i, e) {
        var srcUrl = url.parse($(e).attr('src'));

        if (!srcUrl.host) {
          return url.resolve(reqUrl, srcUrl);
        } else {
          return url.format(srcUrl);
        }
      }));
    }
  });
});
three-cups
  • 4,375
  • 3
  • 31
  • 41