1

I'm new to JavaScript and I'm having trouble with promises. I'm using cloudscraper to retrieve a webpage's html to scrape data from. I have a simple function - getData() - which calls cloudscraper.get() and passes the html to the extract() function, which is responsible for scraping data. This is the working code:

const getData = function(pageUrl) {
  var data;
  return cloudscraper.get(pageUrl)
    .then(function(html) {
      data = extract(html);
      return data;  
    })
    .catch(function(err) {
      // handle error
    })
}

The "data" object returned contains an array of URLs I want to connect to, in order to retrieve other information. That information has to be stored in the same data object. So I want to call cloudscraper.get() method again for each URL contained in the array. I've tried the code below:

const getData = function(pageUrl) {
  var data;
  // first cloudscraper call:
  // retrieve main html
  return cloudscraper.get(pageUrl)
    .then(function(html) {
      // scrape data from it
      data = extract(html);
      for (let i = 0; i < data.array.length; ++i) {
        // for each URL scraped, call cloudscraper
        // to retrieve other data
        return cloudscraper.get(data.array[i])
          .then(function(newHtml) {
            // get other data with cheerio
            // and stores it in the same array
            data.array[i] = getNewData(newHtml);
          })
          .catch(function(err) {
            // handle error
          }) 
        }
        return data;  
      })
    .catch(function(err) {
      // handle error
    })
}

but it doesn't work, because the data object is returned before the promises in the loop are resolved. I know that probably there is a simple solution, but I couldn't figure it out, so could you please help me? Thanks in advance.

Fabio Nardelli
  • 135
  • 2
  • 11
  • 3
    You are `return`ing from the middle of the loop, which doesn't work. Collect multiple promises into an array and use `Promise.all`. – Bergi Jul 29 '19 at 19:20
  • Check my answer at https://stackoverflow.com/questions/57209251/database-query-in-loop-returns-only-an-empty-array/57209669#57209669 – Bill Cheng Jul 29 '19 at 19:23
  • Possible duplicate of [Correct way to write loops for promise.](https://stackoverflow.com/questions/24660096/correct-way-to-write-loops-for-promise) – jmmygoggle Jul 29 '19 at 19:25
  • 1
    If you want to use a for loop, it will be better to use **async/await**, this way you can avoid the **.then** promise chain. – Hosar Jul 29 '19 at 19:31
  • Thank you all for the quick response. Could you please give me an example? I read the answers but I don't know how to adapt them to my case. – Fabio Nardelli Jul 29 '19 at 19:54
  • `Promise.all` would be much more efficient and would take a fraction of the time. – mwilson Jul 29 '19 at 20:29

2 Answers2

1

The best way to avoid these kinds of problems is to use async/await, as suggested in the comments. Here's an example based on your code:

const getData = async function(pageUrl) {
  var data;
  // first cloudscraper call:
  // retrieve main html
  try {
    const html = await cloudscraper.get(pageUrl);
    // scrape data from it
    data = extract(html);
    for (let i = 0; i < data.array.length; ++i) {
      // for each URL scraped, call cloudscraper
      // to retrieve other data
      const newHtml = await cloudscraper.get(data.array[i]);
      // get other data with cheerio
      // and stores it in the same array
      data.array[i] = getNewData(newHtml); // if getNewData is also async, you need to add await
    }
  } catch (error) {
    // handle error
  }
  return data;
}
// You can call getData with .then().catch() outside of async functions 
// and with await inside async functions
Tomasz Kasperczyk
  • 1,991
  • 3
  • 22
  • 43
1

This can be significantly simplified by using Promise.all, and await/async

If my understanding is correct, you are trying to execute the below steps:

  1. Get original HTML
  2. Extract some HTML (looks like you're after some more urls)
  3. For each url extracted, you want to re-call cloudscraper
  4. Put the results of each call back into the original data object.

const getData = async (pageUrl) => {
    const html = await cloudscraper.get(pageUrl);
    const data = extractHtml(html);
    const promises = data.array.map( d => cloudscraper.get(d));
    const results = await Promise.all(promises);
    // If you wanted to map the results back into the originaly data object
    data.array.forEach( (a, idx) => a = results[idx] );
    return data;
};
mwilson
  • 12,295
  • 7
  • 55
  • 95
  • Thanks. The code doesn't work out of the box because I didn't mention (for the sake of simplicity) that `array` is actually an array of objects, each with a `url` attribute, not an array of URLs directly. So I suppose I have to change the inner cloudscraper call to `cloudscraper.get(d.url)` – Fabio Nardelli Jul 29 '19 at 21:03
  • Yea, you have to specify that in the question in order for me to know that. But yes, as you've learned, you can just do `d.url` (or whatever property you're after). Please mark as answer or upvote if this helped. – mwilson Jul 29 '19 at 21:05
  • It doesn't work, maybe I have to change something also in the forEach statement, haven't I? – Fabio Nardelli Jul 29 '19 at 21:14
  • Yes, the `url` fields don't get updated. Technically it's not an error, tough. – Fabio Nardelli Jul 29 '19 at 21:17
  • Ah. I see a typo. Fixed it. `promises[idx]` should have been `results[idx]`. Please remember to upvote – mwilson Jul 29 '19 at 21:20
  • Thank you again for your help. Would a regular for loop be faster than the forEach? – Fabio Nardelli Jul 29 '19 at 22:04
  • You can use whatever iterator you need to. I just did forEach because I didn't need anything special. It really doesn't matter. The point is to just iterate over the results and set them in the other object. Indexes are guaranteed to match up, so you can use whichever method you are most familiar with. – mwilson Jul 29 '19 at 22:10