0

I'm trying to index a food recipes page, and the actual recipe is stored as an object within a JavaScript in the page.

One example URL: http://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing

If I open the developer tool in the browser and type:

console.dir(food.recipeItem.title)

I get the title back:

"Bakt potet med rømme- og blåmuggostdressing"

All nice and dandy, and just what I need. But how can I get ahold of that script and parse it within a Node.js application? Cheerio will maybe help me find the script, but not do much more than that? Or maybe it will? I'm not sure how to do it, and not what is the most computation-effective answer. Or most solid.

Espen Klem
  • 331
  • 2
  • 15

1 Answers1

1

It's pretty easy, all you have to do is parse the returned HTML. If you inspect the returned HTML (view-source:http://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing), you will find a script tag which contains all information you need in several javascript variables. These variables holds JSON data. Since the script is hardcoded directly into the HTML document, and not obtained by XHR or similar, parsing the HTML is the only way of doing this.

So basically you have these 3 steps:

1. send HTTP GET request to the link above

2. parse the HTML string to extract the script tag by using some library (check this link to decide which library to use).

3. parse the javascript string (extracted script from step 2) to extract JSON data. Check UglifyJS library for Node.js

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Boy
  • 1,182
  • 2
  • 11
  • 28