3

I'm trying to scrape data from a CDC website.

I'm using cheerio.js to fetch the data, and copying the HTML selector into my code, like so:

const listItems = $('#tab1_content > div > table > tbody > tr:nth-child(1) > td:nth-child(3)');

However, when I run the program, I just get a blank array. How is this possible? I'm copying the HTML selector verbatim into my code, so why is this not working? Here is a short video showing the issue: https://youtu.be/a3lqnO_D4pM

Here is my full code, along with a link were you can run the code:

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

// URL of the page we want to scrape
const url = "https://nccd.cdc.gov/DHDSPAtlas/reports.aspx?geographyType=county&state=CO&themeId=2&filterIds=5,1,3,6,7&filterOptions=1,1,1,1,1";

// Async function which scrapes the data
async function scrapeData() {
  try {
    // Fetch HTML of the page we want to scrape
    const { data } = await axios.get(url);
    // Load HTML we fetched in the previous line
    const $ = cheerio.load(data);
    // Select all the list items in plainlist class
    const listItems = $('#tab1_content > div > table > tbody > tr:nth-child(1) > td:nth-child(3)');
    // Stores data in array
    const dataArray = [];
    // Use .each method to loop through the elements
    listItems.each((idx, el) => {
      // Object holding data
      const dataObject = { name: ""};
      // Store the textcontent in the above object
      dataObject.name = $(el).text();
      // Populate array with data
      dataArray.push(dataObject);
    });
    // Log array to the console
    console.dir(dataArray);
  } catch (err) {
    console.error(err);
  }
}
// Invoke the above function
scrapeData();

Run the code here: https://replit.com/@STCollier/Web-Scraping#index.js

Thanks for any help.

Scollier
  • 575
  • 6
  • 19
  • 1
    Why scrape? Afaik, most publicly-available US Government data is accessible directly through one or more API's (for reasons of transparency and probably also to prevent potetntial problems with adhoc scrapers). I'm not sure about this particular dataset but a starting point might be [CDC API's](https://open.cdc.gov/apis.html) or [socrata API Endpoints](https://dev.socrata.com/docs/endpoints.html). Just a thought. (Another thought, maybe they are blocking automated requests?) – ashleedawg Mar 24 '22 at 02:29
  • @ashleedawg, I've heard from someone that I might need to specify my user agent, and perhaps this is why I'm not getting the data. As for the CDC API's, I'd rather use traditional web scraping for the simplicity. – Scollier Mar 24 '22 at 14:43
  • Data is probably injected asynchronously by JS. – ggorlen Jan 01 '23 at 02:34

1 Answers1

1

The data is added dynamically after the page load, so the content returned by axios doesn't contain it.

One approach that works at the present time is to use Puppeteer to intercept the network request.

const puppeteer = require("puppeteer"); // ^21.0.2

const url =
  "https://nccd.cdc.gov/DHDSPAtlas/reports.aspx?geographyType=county&state=CO&themeId=2&filterIds=5,1,3,6,7&filterOptions=1,1,1,1,1";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const urlPrefix =
    "https://nccd-proxy.services.cdc.gov/DHDSP_ATLAS/report/state";
  const responseP = page.waitForResponse(res =>
    res.url().startsWith(urlPrefix) &&
    res.request().method() === "POST"
  );
  await page.goto(url);
  const response = await responseP;
  console.log(await response.json());
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output:

{
  TitleLong: 'Heart Disease Hospitalization Rate per 1,000 Medicare Beneficiaries, All Races/Ethnicities, All Genders, Ages 65+, 2018-2020',
  TitleShort: 'Heart Disease Hospitalization Rate per 1,000 Medicare Beneficiaries',
  TitleLegend: 'Age-Standardized Rate per 1,000 Beneficiaries',
  ReportText: 'heart disease hospitalization rate for All Races/Ethnicities, All Genders, Ages 65+ for  is ',
  Data: [
    {
      StateValue: 27.6,
      NationalValue: 41.6,
      RaceName: 'All Races/Ethnicities'
    },
    { StateValue: 36, NationalValue: 51.6, RaceName: 'Black' },
    { StateValue: 27.6, NationalValue: 41.5, RaceName: 'White' },
    { StateValue: 24.9, NationalValue: 32.6, RaceName: 'Hispanic' }
  ]
}

If you want to click buttons on the page to adjust filters, Puppeteer can do that as well.

ggorlen
  • 44,755
  • 7
  • 76
  • 106