Can't fetch the texts from a webpage

Question

I've created a script using python and selenium to get all the text available out there in the following link. The webpage has got lazyloading method active and that is why more content become visible upon each scrolling. My script can handle that too.

However, the problem is when my script makes the webpage exhaust its content by reaching the bottom, it stucks right there. Once it can breaks out of the loop, I can fetch the content. How can I break out of the loop?

I know .LoadingDots is always there. And that is the only reason I can't find any logic to break the loop.

Link to that site

Here is what I've tried so far: (couldn't get rid of the loop)

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get("https://www.quora.com/topic/American-Football")

while True:

    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, ".LoadingDots")))
    except Exception: break

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para"))):
    print(item.text)

driver.quit()

I know I can solve the issue if I comply with the following:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get("https://www.quora.com/topic/American-Football")

last_len = len(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para"))))

while True:
    for load_more in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a[id$='_more']"))):
        driver.execute_script("arguments[0].click();",load_more)

    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait.until(lambda driver: len(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para")))) > last_len)
        items = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para")))
        last_len = len(items)
    except TimeoutException: break

for item in items:
    print(item.text)

driver.quit()

My question is: how can i fetch the content from that page exhausting all the scrolls using the way I tried with my first script making use of .LoadingDots?

You can watch the screen height and break when it stops going up — pguardiario, Nov 29 '18 at 00:05
I surely won't @ewwink. The data of that site are useless to me. All I wish to know is the technique like the way i tried and failed. — SIM, Dec 06 '18 at 19:11

Guy · Accepted Answer · 2018-12-04T11:15:56.203

2

When the page is scrolled to the button the element with classes .LoadingDots.regular remains the same, but its parent element adds new class hidden. You can check if the class was added using get_attribute function. You can also locate it directly with the class spinner_display_area

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    loading_dots = driver.find_element_by_class_name('spinner_display_area')
    if 'hidden' in loading_dots.get_attribute('class'):
        break;

edited Dec 04 '18 at 11:15

answered Dec 04 '18 at 10:11

Guy

46,488
10
44
88

@asmitu Fixed the typo, thanks. The easiest way is to ask the developer what is the locator :). If you already new about `.LoadingDots` as I did I used `$$(".LoadingDots")` at the console to get a list of all matching elements and went over the list to find the correct element, in this case visible element that appeared under the posts for a second or two before the page was reloaded. – Guy Dec 04 '18 at 11:26
Also, in this case (might not always work) you can just deleted the posts `
` (with `class="paged_list_wrapper"`), it will prevent refreshing and the dots will be visible.
– Guy Dec 04 '18 at 11:28
Sorry for any confusion @Guy. What I intened to ask is when I didn't know about the locatior, how can I catch it from inspection (using dev tools) monitoring the scroll? Thanks. – SIM Dec 04 '18 at 11:35
@asmitu There isn't very easy method. You need to stop the loading. One way is to delete the loading element from the html (see my second comment). Another way is to put breakpoint on the loaded element. If you highlight the row in the html on the left side you will see 3 dots. Click on it and choose `break on > subtree modification`. In this point the page will stop before reloading, you can use `right click > scroll into view` on the last item and look at the html tree under it. Standing with the cruiser on an element in the html will highlight it on the page. – Guy Dec 04 '18 at 12:00
You can't believe how easy it is!! Just check out the answer given by [Joseph Tinoco](https://stackoverflow.com/questions/19422214/how-can-i-inspect-disappearing-element-in-a-browser) Lets wait while the bounty is active if something more robust comes along. Thanks. – SIM Dec 04 '18 at 12:12
1

@asmitu That's exactly what I said. – Guy Dec 04 '18 at 12:14

score 0 · Answer 2 · answered Dec 07 '18 at 16:55

Your script doesn't work as expected because (By.CSS_SELECTOR, ".LoadingDots") selector returns this element <div class="LoadingDots tiny"> and it is always hidden so your expectation of its invisibility always returns True and loop cannot be broken.

You need to check another element with "LoadingDots" class name: <div class="LoadingDots regular"> and the logic should be following:

Scroll page down
Wait for loading dots to appear (start loading more content)
Wait for loading dots to disappear (loading more content is done)

If after page scrolled we see no dots - break the loop

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 5)
driver.get("https://www.quora.com/topic/American-Football")

while True:
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".LoadingDots.regular")))
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, ".LoadingDots.regular")))
    except Exception: continue
    else: break

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para"))):
    print(item.text)

driver.quit()

BUT! Note that I've posted this script just to point on reason why your script is not working... It's not really efficient as in case content loaded too fast (possibility is quite low, but...) script might not catch the moment when loading dots appeared and you'll not get all required content.

So @Guy solution seem to be more reliable (+1)

Can't fetch the texts from a webpage

2 Answers2