2

I'm trying to scrape this web page for the arguments that are in each of the headers.

What I've tried to do is scroll all the way to the bottom of the page so all the arguments are revealed (it doesn't take that long to reach the bottom of the page) and then extract the html code from there.

Here's what I've done. I got the scrolling code from here by the way.

SCROLL_PAUSE_TIME = 0.5

#launch url
url = 'https://en.arguman.org/fallacies'

#create chrome sessioin
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(url)

#get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")


while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

http = urllib3.PoolManager()
response = http.request('GET', url)
soup = BeautifulSoup(response.data, 'html.parser')

claims_h2 = soup('h2')
claims =[]
for c in claims_h2:
    claims.append(c.get_text())

for c in claims:
    print (c)

This is what I get, which are all the arguments you would see without scrolling and having more added to the page.

Plants should have the right to vote.
Plants should have the right to vote.
Plants should have the right to vote.
Postmortem organ donation should be opt-out
Jimmy Kimmel should not bring up inaction on gun policy (now)
A monarchy is the best form of government
A monarchy is the best form of government
El lenguaje inclusivo es innecesario
Society suffers the most when dealing with people having mental disorders
Illegally downloading copyrighted music and other files is morally wrong.

If you look and scroll all the way to the bottom of the page you'll see these arguments as well as many others.

Basically, my code doesn't seem to parse the updated html code.

AlexT
  • 589
  • 2
  • 9
  • 23
  • Hello. I am the creator of this web site. You can just increase the offset parameter and crawl it by urllib or something else instead of using selenium. https://en.arguman.org/fallacies?offset=20 – Fatih Erikli Feb 08 '19 at 01:19
  • [enter link description here](https://stackoverflow.com/questions/56586016/requests-html-and-infinite-scrolling) There is also a way to avoid using Selenium, see – Victoria Oct 29 '20 at 18:57

1 Answers1

4

It doesn't make sense to open the site with Selenium, do all the scrolling, and then make the request again with urllib. The two processes are completely separate and unrelated.

Instead, when the scrolling is complete, pass driver.page_source to BeautifulSoup and extract the content from there:

import time

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
driver.implicitly_wait(30)

try:
    SCROLL_PAUSE_TIME = 0.5
    driver.get("https://en.arguman.org/fallacies")

    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(SCROLL_PAUSE_TIME)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    soup = BeautifulSoup(driver.page_source, "html.parser")

    for c in soup("h2"):
        print(c.get_text())

finally:
    driver.quit()

Result:

Plants should have the right to vote.
Plants should have the right to vote.
Plants should have the right to vote.
Postmortem organ donation should be opt-out
Jimmy Kimmel should not bring up inaction on gun policy (now)
A monarchy is the best form of government
A monarchy is the best form of government
El lenguaje inclusivo es innecesario
Society suffers the most when dealing with people having mental disorders
Illegally downloading copyrighted music and other files is morally wrong.
Semi-colons are pointless in Javascript
You can't measure how good a programming language is.
You can't measure how good a programming language is.
Semi-colons are pointless in Javascript
Semi-colons are pointless in Javascript
Semi-colons are pointless in Javascript
...
cody
  • 11,045
  • 3
  • 21
  • 36
  • I see, so it was as if I was opening the page again, and all the scrolling wasn't being taken into account. Thank you for your answer – AlexT Feb 03 '19 at 03:33