I'm trying to scrape this web page for the arguments that are in each of the headers.
What I've tried to do is scroll all the way to the bottom of the page so all the arguments are revealed (it doesn't take that long to reach the bottom of the page) and then extract the html code from there.
Here's what I've done. I got the scrolling code from here by the way.
SCROLL_PAUSE_TIME = 0.5
#launch url
url = 'https://en.arguman.org/fallacies'
#create chrome sessioin
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(url)
#get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
http = urllib3.PoolManager()
response = http.request('GET', url)
soup = BeautifulSoup(response.data, 'html.parser')
claims_h2 = soup('h2')
claims =[]
for c in claims_h2:
claims.append(c.get_text())
for c in claims:
print (c)
This is what I get, which are all the arguments you would see without scrolling and having more added to the page.
Plants should have the right to vote.
Plants should have the right to vote.
Plants should have the right to vote.
Postmortem organ donation should be opt-out
Jimmy Kimmel should not bring up inaction on gun policy (now)
A monarchy is the best form of government
A monarchy is the best form of government
El lenguaje inclusivo es innecesario
Society suffers the most when dealing with people having mental disorders
Illegally downloading copyrighted music and other files is morally wrong.
If you look and scroll all the way to the bottom of the page you'll see these arguments as well as many others.
Basically, my code doesn't seem to parse the updated html code.