0

I am trying to scrape through the following website : https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html

to get all of the state statistics on coronavirus.

My code below works:

require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'

  url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
  doc = Nokogiri::HTML.parse(open(url))
  total_cases = doc.css("span.count")[0].text
  total_deaths = doc.css("span.count")[1].text
  new_cases = doc.css("span.new-cases")[0].text
  new_deaths = doc.css("span.new-cases")[1].text

However, I am unable to get into the collapsed data/gridcell data.

I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.

3 Answers3

0

That page is using AJAX to load its data.

in that case you may use Watir to fetch the page using a browser

as answered here: https://stackoverflow.com/a/13792540/2784833

Another way is to get data from the API directly.

You can see the other endpoints by checking the network tab on your browser console

Layon Ferreira
  • 407
  • 5
  • 9
0

I replicated your code and found some of the errors that you might have done

require 'HTTParty'

will not work. You need to use

require 'httparty'

Secondly, there should be quotes around your variable url value i.e

url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"

Other than that, it just worked fine for me.

Also, if you're trying to get the Covid-19 data you might want to use these APIs

For US Count For US Daily Count For US Count - States

You could learn more about the APIs here

rkalra
  • 365
  • 3
  • 13
  • Hi! Thank you for that information, however, I was told to use the CDC for the information and to be able to search by state and by county. I will update my requires. Thank you for your help. – SincerelyBrittany May 12 '20 at 16:13
  • can you share whats the error you're getting while executing the above script? – rkalra May 12 '20 at 17:06
0

Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.

Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.

When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".

When browsing through the requests you'll find that the data you're looking for is located at:

https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json

screenshot developer tools

Since this is JSON you don't need "nokogiri" to parse it.

require 'httparty'
require 'json'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)

When executing the above you'll get the exception:

JSON::ParserError ...

This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.

response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"

To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.

require 'httparty'
require 'json'
require 'stringio'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145,  ...

If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:

data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
3limin4t0r
  • 19,353
  • 2
  • 31
  • 52