1

I am using jsdom in order to parse results from google shopping. The following code takes a google shopping link, parses it, and extracts the table that contains all of the results:

const jsdom = require('jsdom').JSDOM;

function parseSite() {
const url = "https://www.google.com/shopping/product/8352592323560827089/online";
let trimmedTable = "";

jsdom.fromURL(url).then(function (dom) {
    let innerHtml = dom.window.document.querySelector('html').innerHTML;
    let tableStartIndex = innerHtml.search("<tbody><tr ");
    let nonTrimmedTable = innerHtml.substr(tableStartIndex + 7, innerHtml.length);
    let tableEndIndex = nonTrimmedTable.search("</tbody></table>");
    trimmedTable = nonTrimmedTable.substr(0, tableEndIndex);
});
}

parseSite();

I realize that promises are asynchronous and it seems like I am trying to use it in a synchronous manner, but jsdom was the only thing I could find that loads the entire webpage as if it were a web browser. I do not want to use selenium because performance would take a hit. The code itself works exactly how I want it to, I just need to get the result of trimmedTable outside of the promise.

My question: Is there something better out there than jsdom for loading and extracting data from web pages as if they were being loaded in the browser? (something that can accomplish what I am trying to do in the provided code) If not, how can I write my code so that I can get the result of trimmedTable assigned to a variable outside of the promise?

  • 1
    Does this answer your question? [How do I return the response from an asynchronous call?](https://stackoverflow.com/questions/14220321/how-do-i-return-the-response-from-an-asynchronous-call) – Roamer-1888 Jul 06 '20 at 00:50
  • Please tell me JSDOM doesn't require string handling of HTML in order to scrape a web page? – Roamer-1888 Jul 06 '20 at 00:55
  • @Roamer-1888 I saw that but i could not figure out how to apply it to my scenario since i am new to web development. It doesn't require string handling, I am choosing to get the HTML as a string by using .innerHTML. – Moist Carrot Jul 06 '20 at 01:10
  • You shouldn't need string handlling to find a DOM node. Try finding the table using CSS selector 'table' (as you are currently doing with `.querySelector('html')`) *then*, if you need the table's HTML, get its `.innerHTML` property. But it seems unlikely that you really need the table's innerHTML. The caller of `parseSite()` would be better off with the table node from which it can extract the innerHTML if necessary (which it probably isn't). – Roamer-1888 Jul 06 '20 at 01:17
  • You might want `dom.window.document.querySelectorAll('table')[0]` or `dom.window.document.querySelectorAll('table')[1]` – Roamer-1888 Jul 06 '20 at 01:20
  • The more you can work with the DOM node (and its contents) the better. HTML is just an awkward string. – Roamer-1888 Jul 06 '20 at 01:22
  • @Roamer-1888 Just tried using the querySelectorAll like you said but for some reason it doesn't grab all the table rows like it does when I just grab the entire HTML of the page. When I grab the HTML from the entire page and look for my specific table using search/substr i am able to get all the data though. I'm still not sure how i can put that data into a string outside of the promise though so i can parse it. – Moist Carrot Jul 06 '20 at 01:29
  • You may have to grab `let html = dom.window.document.querySelector('html')` then grab the table from `html`. – Roamer-1888 Jul 06 '20 at 01:34
  • [How do I return the response from an asynchronous call?](https://stackoverflow.com/questions/14220321/how-do-i-return-the-response-from-an-asynchronous-call) will answer your question on using asynchronously derived data. – Roamer-1888 Jul 06 '20 at 01:36

1 Answers1

0

I was able to solve my problem by using an asynchronous function combined with Promise.all().

The great thing about using Promise.all() is that I can create multiple async functions, and then get all the values returned from the promises in those functions, into one function. Then, I can assign those values into a variable and process them however I want. Here is my code now:

let retailers = async function parseRetailers() {
    const webPage = await jsdom.fromURL("https://www.google.com/shopping/product/8352592323560827089");

    let innerHtml = webPage.window.document.querySelector('html').innerHTML;
    let tableStartIndex = innerHtml.search("<tbody><tr "); //len=7
    let nonTrimmedTable = innerHtml.substr(tableStartIndex + 7, innerHtml.length);
    let tableEndIndex = nonTrimmedTable.search("</tbody></table>");

    return nonTrimmedTable.substr(0, tableEndIndex);
}

let allRetailers = [];

Promise.all([retailers()])
    .then(values => {
        allRetailers = values;
});