1

I am having a trouble trying to get an img using axios from an url. The problem is that axios is returning the content before all the data is loaded to the page. I am already using async/await sintax, but I can't figure out how to make the request wait untill all the data is loaded. For example, using cheerio to try to get an img, the result is undefined due this concept.

Here is my code about getting an url:

async function getUrl(url){
      const request =   await axios.get(url);
      const html    =   request.data;
      const $       =   cheerio.load(html);
      return $;
}

Is there a way to check if all data is loaded?

The page that I am scraping to get the main img is the following: https://www.saatchiart.com/art/Painting-Goat/313699/2073158/view

Thanks

Jotan
  • 607
  • 1
  • 5
  • 13
  • Looks like the page sets the image with javascript. The javascript from the page you load does not get executed from axios or cheerio. – Roland Starke Sep 24 '20 at 08:09
  • 2
    The image src is multiple times in the page source you try to crawl. You could load it from `$('meta[property="og:image"]').attr('content')` for example. – Roland Starke Sep 24 '20 at 08:10
  • 1
    Wow, it's working perfectly. Thanks!!. Could you explain me a bit how is working by this way?? So interesting – Jotan Sep 24 '20 at 08:19
  • 2
    My process was to open the link you provided. Found out the image url was "https://images.saatchiart.com/saatchi/313699/art/2263009/1338201-PVYPVGMK-7.jpg" then I viewed the source of the link with "view-source:https://www.saatchiart.com/art/Painting-Goat/313699/2073158/view" and searched for the image url. (I found 7 matches and the match in the meta tag seemed to be the simplest to extract the image url from.) – Roland Starke Sep 24 '20 at 08:26

1 Answers1

0

When you visit a web page, you're served a plain text HTML document. That HTML document often has links to resources like scripts, images, CSS and so forth, as well as possible inline JavaScript. The browser parses this page, executes the JS and requests resources. This process takes a bit of time.

Axios doesn't have a concept of waiting for any of this stuff--it's just an HTTP request library, so it asks the server for the static HTML text (or some other resource), but unlike a browser, it doesn't parse and render the HTML or execute the JS. It just hands you the same plain text HTML you'd see if you went to view-source: in your browser.

Cheerio also doesn't do anything with JS. It accepts a string of HTML and lets you traverse and manipulate it. That's it!

All that said, it turns out that the image you want is in the static HTML:

const axios = require("axios"); // ^1.2.2
const cheerio = require("cheerio"); // 1.0.0-rc.12

const url = "<your URL>";

axios.get(url).then(({data: html}) => {
  const $ = cheerio.load(html);
  console.log($('meta[property="og:image"]').attr("content"));
});

...but sometimes it's not, and so you may wish to visit How can I scrape pages with dynamic content using node.js? for general strategies, such as using Puppeteer to automate the browser.

ggorlen
  • 44,755
  • 7
  • 76
  • 106