2

I want to scrape data from web-page that is constantly changing (new posts every couple of seconds). I'm calling driver.get() in a while loop but after a couple of repetitions I'm not getting new results. It constantly returns the same post over and over. I'm sure that the page is changing (checked in the browser)

I tried to use time.wait() and driver.refresh() but the problem persists

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=self.cp.getSeleniumDriverPath())

    while True:
        driver.get(url)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        posts = soup.find_all(some class)

        (...)
        some logic with the result
        (...)

        driver.refresh() #tried interchangably with driver.get() from the beginning of loop

As far as I know, driver.get() should wait for a page to load before executing next line of code. Maybe I did something wrong language-wise (I'm pretty new to python). Should I reset some attributes of driver every loop run? I've seen solutions that are using driver.get() in a loop like that, but it is not working in my case. How do I force the driver to fully refresh the page before scraping it?

Maciej K
  • 21
  • 1
  • 2
  • why are you using selenium with beautiful soup? Why not just use requests module to retrieve the page? – dataviews May 08 '19 at 20:36
  • You also don't need the while loop. Python executes code line by line, so your logic would come after the page is retrieved – dataviews May 08 '19 at 20:39
  • replace driver.refresh() with driver.quit() – dataviews May 08 '19 at 20:44
  • @dataviews I have used both before. `selenium` loads the javascript, but `BeautifulSoup` I find easier to process all my elements with. – Reedinationer May 08 '19 at 20:45
  • Welcome to SO. Are you getting any error message? – supputuri May 08 '19 at 20:58
  • 1
    @dataviews Not sure why the recommendation of `driver.quit()` as that line will close the browser and session. OP has to loop iterate multiple times. However, OP has to make sure to break/exist the loop with some logic. – supputuri May 08 '19 at 21:02
  • @supputuri I'm not getting any errors. It's just the program doesn't act like I want it to act and I don't know where I told it to act this wayt. And about the loop. I know it is not the good way but, at this moment, I want it to loop that way – Maciej K May 08 '19 at 21:08

2 Answers2

1

selenium will have errors if the page is in the process of loading when you try to send commands to the window. You should implement a time.sleep() or some selenium specific wait method to make sure that the page is ready to be processed. Something like

import time

    while True:
        driver.get(url)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        posts = soup.find_all(some class)

        (...)
        some logic with the result
        (...)

        driver.refresh()
        time.sleep(5) # probably too long, but I usually try to stay on the safe side

The best option would probably be to use something like

element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )

from the link above I posted, this will make sure the element is there while not forcing a wait of 5 seconds. If the element you want is there in .0001 seconds your script will continue after that long. This lets you make the timeout arbitrarily large (say, 120 seconds) without impacting your execution speed.

Reedinationer
  • 5,661
  • 1
  • 12
  • 33
0

I'm guessing your Chrome webdriver is caching. Try adding this: driver.manage().deleteAllCookies() before getting the page.

Frederik Bode
  • 2,632
  • 1
  • 10
  • 17
  • I tried your solution and it worked for a moment. Then it began to get some older versions of page (there is like 50 posts per page and I was getting older posts). When I restart the program it gets the latest posts, but later it breaks like I mentioned – Maciej K May 08 '19 at 21:06
  • maybe if you use `options.add_argument("--incognito")`? Your browser shouldn't cache anymore in incognito mode. – Frederik Bode May 08 '19 at 21:22
  • It works a lot better then before. I'm still getting older versions every 2-4 loop runs, but this is the best reult I've got so far – Maciej K May 08 '19 at 21:39
  • And you're sure the page is uploading new images faster than you can process them? Otherwise it would be logical that you get old posts. This stackoverflow post (https://stackoverflow.com/questions/8009823/possible-to-disable-firefox-and-chrome-default-caching) suggests switching to a Firefox driver and disabling all its caches, you could try that, but your application might require the Chrome driver. – Frederik Bode May 09 '19 at 08:20