1

I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:

import requests
from bs4 import BeautifulSoup

url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)

It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.

Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?

SoftTimur
  • 5,630
  • 38
  • 140
  • 292
  • Possible duplicate of [How can I scroll a web page using selenium webdriver in python?](https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python) – gdlmx Mar 05 '19 at 03:04

4 Answers4

2

Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools. See example for quora

the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.

other easier solution will be by using selenium

Majd Msahel
  • 56
  • 1
  • 1
0

I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…

this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675

linizio
  • 185
  • 6
0

If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.

You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).

Grace Fu
  • 23
  • 5
0

Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.

I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.

The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.

To use the code below you need to install chromedriver. http://chromedriver.chromium.org/downloads

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By

    CHROMEDRIVER_PATH = ""
    CHROME_PATH = ""
    WINDOW_SIZE = "1920,1080"

    chrome_options = Options()
    # chrome_options.add_argument("--headless")  
    chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
    chrome_options.binary_location = CHROME_PATH
    prefs = {'profile.managed_default_content_settings.images':2}
    chrome_options.add_experimental_option("prefs", prefs)

    url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"

    def scrape(url, times):

    if not url.startswith('http'):
        raise Exception('URLs need to start with "http"')

    driver = webdriver.Chrome(
    executable_path=CHROMEDRIVER_PATH,
    chrome_options=chrome_options
    )

    driver.get(url)

    counter = 1
    while counter <= times:

        q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
        questions = [x for x in q_list.find_elements_by_xpath('//div[@class="pagedlist_item"]')]
        q_len = len(questions)
        print(q_len)

        html = driver.find_element_by_tag_name('html')
        html.send_keys(Keys.END)

        wait = WebDriverWait(driver, 5)
        time.sleep(5)

        questions2 = [x for x in q_list.find_elements_by_xpath('//div[@class="pagedlist_item"]')]
        print(len(questions2))

        counter += 1

    driver.close()

if __name__ == '__main__':
    scrape(url, 5)
sammyyy
  • 343
  • 3
  • 9
  • Thank you for the full code... But are you sure `driver.implicitly_wait(5)` works? the browser closes immediately in my test, and we get `questions2` same as `questions`. – SoftTimur Mar 05 '19 at 03:44
  • Additionally, we need to scroll down for extra load, we don't see `scroll down` in your code. – SoftTimur Mar 05 '19 at 03:46
  • Used sending End key to mimic scroll down. Updated code with wait and time.sleep. This should not be the best way to do this, but I can't find out how to use EC to wait for element to appear in the DOM. – sammyyy Mar 05 '19 at 05:27