-1

I am trying to scrape some website using Cheerio, however since the app is dynamic the content is not present in the HTML but on a JS object that I am not sure how to access (I have tried window, document etc.)

My code:

let axios = require('axios') // HTTP client
let cheerio = require('cheerio') // HTML parsing package

const url = 'https://www.foo.com'

const getWebsiteContent = async (url) => {
    try {
        const response = await axios.get(url)
        const $ = cheerio.load(response.data)
        console.log(response.data)
    } catch (error) {
        console.error(error)
    }
}

getWebsiteContent(url)

The result of the console.log (I am just pasting the part of it that I neeed to access):

<!DOCTYPE html>
<html lang='en' ng-app='Test'>
<head>
</head>
<body class='' data-allow-utf8='false'>
<h1>HEADER</h1>
<script>
  var matchData = function () {
    Live.load.main({
      version:           "1.2",
      sports:            [
          {
              title: 'matchone',
              subtitle: 'foo'
          },
          {
              title: 'matchtwo',
              subtitle: 'aaa'
          }
      ],
    })
}


</script>
<!-- More stuff -->
</body>
</html>

The data I want to access is the sports array, contained in that Live.load.main method inside matchData function.

I am not even sure if Cheerio is the correct tool since I was expecting the data to be in a piece of HTML but apparently is loaded in some way that I can only see it in a JS object when firing the GET request.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
greatTeacherOnizuka
  • 555
  • 1
  • 6
  • 24
  • Possible duplicate of [How can I scrape pages with dynamic content using node.js?](https://stackoverflow.com/questions/28739098/how-can-i-scrape-pages-with-dynamic-content-using-node-js) – Caltor Dec 19 '18 at 10:13

1 Answers1

1

First, get the content of the script tag with $('script').text(). You may need to adjust the selector if there are more script tags on the page. Then match the array you want to access with regex:

const script = $('script').text();
const [, arrStr] = script.match(/sports:\s+(\[[\s\S]+\])/);

Finally, use use eval to turn the string into an array:

const arr = eval(arrStr);

See demo.

Michał Perłakowski
  • 88,409
  • 26
  • 156
  • 177