1

I'm trying to parse a specification website from saved HTML on my computer. I can post the file upon request.

I'm burnt out trying to figure out why it won't run synchronously. The comments should log the CCCC's first, then BBBB's, then finally one AAAA.

The code I'm running will not wait at the first hurdle (it prints AAAA... first). Am I using request-promise incorrectly? What is going on?

Is this due to the .each() method of cheerio (I'm assuming it's synchronous)?

const rp = require('request-promise');
const fs = require('fs');
const cheerio = require('cheerio');

async function parseAutodeskSpec(contentsHtmlFile) {
  const topics = [];
  const contentsPage = cheerio.load(fs.readFileSync(contentsHtmlFile).toString());
  const contentsSelector = '.content_htmlbody table td div div#divtreed0e338374 nobr .toc_entry a.treeitem';

  contentsPage(contentsSelector).each(async (idx, topicsAnchor) => {
    const topicsHtml = await rp(topicsAnchor.attribs['href']);
    console.log("topicsHtml.length: ", topicsHtml.length);
  });

  console.log("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA");

  return topics;
}
Nick Bull
  • 9,518
  • 6
  • 36
  • 58
  • 1
    I would guess it is because of `.each`, but I'm not completely sure tbh. Have you tried converting the result of `contentsPage(contentsSelector)` into an Array (if it isn't already) and then use the `for .. of ..` syntax? – lumio Nov 06 '18 at 16:24
  • The problem could be this `const topicsPage = cheerio.load(topicsHtml);`. I think it should be something like `const topicsPage = await cheerio.load(topicsHtml);` – Kubwimana Adrien Nov 06 '18 at 16:26
  • @KubwimanaAdrien Makes no difference if there's an `await` keyword or not – Nick Bull Nov 06 '18 at 16:42
  • Looking at the source code for [`.each`](https://github.com/cheeriojs/cheerio/blob/master/lib/api/traversing.js#L298) it looks like it's a synchronous function. There shouldn't be a need for the `await` keyword before calling `.each` – TheF1rstPancake Nov 06 '18 at 16:56
  • @TheF1rstPancake Without `await` it outputs `AAAAA ... \n topicsHtml.length: 10128`. So I have no idea – Nick Bull Nov 06 '18 at 17:00
  • 1
    I think it has to do with the async call occurring in `.each`... see https://stackoverflow.com/questions/37576685/using-async-await-with-a-foreach-loop – brietsparks Nov 06 '18 at 17:07
  • @lumio, you seem to have the right idea! – Nick Bull Nov 07 '18 at 09:20
  • Same with you @bsapaka, thank you both! – Nick Bull Nov 07 '18 at 09:21

3 Answers3

1

As @lumio stated in his comment, I also think that this is because of the each function being synchrone.

You should rather use the map method, and use the Promise.all() on the result to wait enough time:

const obj = contentsPage(contentsSelector).map(async (idx, topicsAnchor) => {
  const topicsHtml = await rp(topicsAnchor.attribs['href']);
  console.log("topicsHtml.length: ", topicsHtml.length);

  const topicsFromPage = await parseAutodeskTopics(topicsHtml)
  console.log("topicsFromPage.length: ", topicsFromPage.length);

  topics.concat(topicsFromPage);
})

const filtered = Object.keys(obj).filter(key => !isNaN(key)).map(key => obj[key])

await Promise.all(filtered)

console.log("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA");
Hammerbot
  • 15,696
  • 9
  • 61
  • 103
  • Nope, throws an error. `Promise.all` takes an array of functions! `map` is synchronous too so I'm not sure I follow why that's important here – Nick Bull Nov 06 '18 at 16:43
  • @NickBull [`Promise.all()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/all) takes an iterable of _promises_, not an iterable of functions. Since an `async` function always returns a `Promise`, this should in theory work. Maybe you could help by actually elaborating what the error is and which line it occurs on in the snippet you've provided. – Patrick Roberts Nov 06 '18 at 17:13
  • @PatrickRoberts Sorry, you are right! I should have been more specific. I meant that this returns an `Object`, not a `Promise` – Nick Bull Nov 06 '18 at 17:40
  • Hi @NickBull I edited my answer. Note the `const filtered = ...` that transforms the object generated by the map function into an array of promises. – Hammerbot Nov 07 '18 at 09:02
  • @Hammerbot Sorry I just remembered about this question. Unfortunately I still believe that this has issues, as `.map()` does not handle asynchronicity well in the callback. See my answer for how I've gone about avoiding that – Nick Bull Nov 07 '18 at 09:14
  • @Hammerbot Thank you for the `Promise.all()` example though, that really helped to push the answer in the correct direction! – Nick Bull Nov 07 '18 at 09:17
1

Try it this way:

let hrefs = contentsPage(contentsSelector).map((idx, topicsAnchor) => {
  return topicsAnchor.attribs['href']
}).get()


let topicsHtml
for(href of hrefs){
  topicsHtml = await rp(href);
  console.log("topicsHtml.length: ", topicsHtml.length);
}

Now the await is outside of map or each which doesn't quite work the way you think.

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • This is a great answer! I think `.toArray` is slightly more appropriate here? But Reading upon it, I'm unsure if it really makes a difference though, past readability. You were definitely on the right path about not using a `Promise` of any sort within the `map()` - it doesn't seem to honour the synchronicity that way. Is that a general callback thing, or just a quirk of `cheerio`? – Nick Bull Nov 07 '18 at 09:19
  • 1
    It's not just cheerio, js map works the same way. The way I look at is the await will prevent the code below it from running before it comes back but not the iterator (higher order function?) itself from continuing. – pguardiario Nov 07 '18 at 23:44
  • Thanks, accepted as the answer as this achieved what I needed. I've "improved" it a little for my needs in my answer :) – Nick Bull Nov 08 '18 at 11:09
1

Based on the other answers here I came to a rather elegant conclusion. Note the avoidance of async/await in the .map() callback, as cheerio's callbacks (and from what I've learned about async/await, generally all callbacks) seem not to honour the synchronous nature of await well:

async function parseAutodeskSpec(contentsHtmlFile) {
  const contentsPage = cheerio.load(fs.readFileSync(contentsHtmlFile).toString());
  const contentsSelector = '.content_htmlbody table td div div#divtreed0e338374 nobr .toc_entry a.treeitem';

  const contentsReqs = contentsPage(contentsSelector)
    .map((idx, elem) => rp(contentsPage(elem).attr('href')))
    .toArray();

  const topicsReqs = await Promise.all(contentsReqs)
    .map(req => parseAutodeskTopics(req));

  return await Promise.all(topicsReqs);
}
Nick Bull
  • 9,518
  • 6
  • 36
  • 58