-1

I am quite new to using beauifulsoup, I am trying to scrape the text from the website using the code below. However, the find_all returns nothing.

import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://beta.regulations.gov/document/USCIS-2019-0010-9175').read()
soup = BeautifulSoup(page.content,'html.parser')
text = soup.find_all(class_="px-2")
print(text)

html for website

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
RawisWar
  • 29
  • 4
  • probably the content is retrive using javascript – GiovaniSalazar Jan 04 '20 at 23:42
  • Most likely that content isn't in the original source, but generated by Javascript. You can check the source of the original page to see if it's there. BeautifulSoup can't execute JS, it just looks at the HTML source. – Robin Zigmond Jan 04 '20 at 23:44
  • 1
    I'm voting to close this. The data is generated dynamically on the website, this exact issue has been covered many times before on Stack Overflow. I don't believe it will bring anything of value. – AMC Jan 05 '20 at 02:10
  • Other questions which cover this exact issue: https://stackoverflow.com/q/51117692/11301900, https://stackoverflow.com/q/55351871/11301900, https://stackoverflow.com/q/51984646/11301900, https://stackoverflow.com/q/38334715/11301900, https://stackoverflow.com/q/44867425/11301900, https://stackoverflow.com/q/48313615/11301900, https://stackoverflow.com/q/36060624/11301900, https://stackoverflow.com/q/57226247/11301900, https://stackoverflow.com/q/36854623/11301900, https://stackoverflow.com/q/52102257/11301900, https://stackoverflow.com/q/59205843/11301900. I'm certain that there are many more. – AMC Jan 05 '20 at 02:21

1 Answers1

2

As stated in the comments, the data is loaded dynamically via Javascript. But when you open Firefox/Chrome network tab, you can see from where the data is coming from:

import requests

url = 'https://beta.regulations.gov/document/USCIS-2019-0010-9175'
ajax_url = 'https://beta.regulations.gov/api/documentdetails/{}'

document_id = url.split('/')[-1]
data = requests.get(ajax_url.format(document_id)).json()

# from pprint import pprint # <-- uncoment to see all data
# pprint(data)

print(data['data']['attributes']['content'])

Prints:

Rescind the increase in fees. This is draconian. For all intents and purposes, denying access to this information will prevent many Americans from knowing where they came from. This is an outrage. This is not the mark of a democracy. I strongly disagree with this fee increase
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91