2

I'm trying to do a webscraping. Until now I have the code to extract values from one page and change to the next page. But when I loop the process to do the same for all other pages it returns an error. Until now I have this code:

import time
import requests
import pandas

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import json


driver = webdriver.Chrome('C:\DRIVERS\chromedriver.exe')
driver.get('https://www.remax.pt/comprar?searchQueryState={%22regionName%22:%22%22,%22businessType%22:1,%22listingClass%22:1,%22page%22:1,%22sort%22:{%22fieldToSort%22:%22ContractDate%22,%22order%22:1},%22mapIsOpen%22:false}')
driver.maximize_window()
driver.implicitly_wait(5)
wait = WebDriverWait(driver, 10)
cookies = driver.find_element_by_id('rcc-decline-button')
cookies.click()

element_list = []
for j in range (1,2569):
    try:
        for i in range(1,40,2):
            link = driver.find_element_by_xpath("(//div[@class='listing-search-searchdetails-component'])[{0}]".format(i))
            link.click()
            try:
                detalhes = driver.find_element_by_id('details')
                preço = driver.find_element_by_id('listing-price')
                tipo = driver.find_element_by_id('listing-title')
                freguesia = driver.find_element_by_xpath('//h5[@class="listing-address"]')
                imoveis = [detalhes.text, preço.text, tipo.text, freguesia.text]
                element_list.append(imoveis)
            finally:
                driver.back()
    finally:
        wait.until(EC.element_to_be_clickable((By.XPATH,"//a[@class='page-link'][.//span[.='Next']]"))).click()

All the values are scraped in the frist page but when it chages the page this error show up:

ERROR:

---------------------------------------------------------------------------
StaleElementReferenceException            Traceback (most recent call last)
<ipython-input-7-052f5032275d> in <module>
     12         for i in range(1,40,2):
     13             link = driver.find_element_by_xpath("(//div[@class='listing-search-searchdetails-component'])[{0}]".format(i))
---> 14             link.click()
     15             try:
     16                 detalhes = driver.find_element_by_id('details')

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in click(self)
     78     def click(self):
     79         """Clicks the element."""
---> 80         self._execute(Command.CLICK_ELEMENT)
     81 
     82     def submit(self):

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params)
    631             params = {}
    632         params['id'] = self._id
--> 633         return self._parent.execute(command, params)
    634 
    635     def find_element(self, by=By.ID, value=None):

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

~\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=90.0.4430.72)

What element is this?

vitaliis
  • 4,082
  • 5
  • 18
  • 40
jps17183
  • 155
  • 7

1 Answers1

1

I'm posting improved version. However. I cannot say that I am completely satisfied with it. I tried at least three other options, but I could not click Next button without executing Javascript. I am leaving the options I tried commented because I want you to see them.

import time
import requests
import pandas

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import json

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.get(
    'https://www.remax.pt/comprar?searchQueryState={%22regionName%22:%22%22,%22businessType%22:1,%22listingClass%22:1,%22page%22:1,%22sort%22:{%22fieldToSort%22:%22ContractDate%22,%22order%22:1},%22mapIsOpen%22:false}')
driver.maximize_window()
driver.implicitly_wait(5)
wait = WebDriverWait(driver, 15)
cookies = driver.find_element_by_id('rcc-decline-button')
cookies.click()

element_list = []
for j in range(1, 2569):
    try:
        for i in range(1, 40, 2):
            wait.until(EC.element_to_be_clickable((By.XPATH, "(//div[@class='listing-search-searchdetails-component'])[{0}]".format(i))))
            link = driver.find_element_by_xpath(
                "(//div[@class='listing-search-searchdetails-component'])[{0}]".format(i))
            link.click()
            try:
                detalhes = driver.find_element_by_id('details')
                preco = driver.find_element_by_id('listing-price')
                tipo = driver.find_element_by_id('listing-title')
                freguesia = driver.find_element_by_xpath('//h5[@class="listing-address"]')
                imoveis = [detalhes.text, preco.text, tipo.text, freguesia.text]
                element_list.append(imoveis)
            finally:
                driver.find_element_by_css_selector(".modal-close-icon").click()
    finally:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        next_btn = driver.find_element_by_xpath("//a[@class='page-link'][.//span[.='Next']]")
        # next_btn.send_keys(Keys.PAGE_DOWN)
        # driver.execute_script("arguments[0].scrollIntoView();", next_btn)
        wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@class='page-link'][.//span[.='Next']]/span")))
        # actions = ActionChains(driver)
        # actions.move_to_element(next_btn)
        # actions.click().perform()
        driver.execute_script("arguments[0].click();", next_btn)

Also, note, that I have modified some of your code from the inside to make it more stable (added few locators). Currently, it clicks the next button.

You need to implement it further. I mean you need to grab all the listing again and loop through them. For this you need to wait until the next page completely loads. I don't have an answer for it yet. It will require more time.

Here is the great question about the difference between Selenium's click and JS click WebDriver click() vs JavaScript click()

The question is not completely answered yet, just made more stable. Your click on the next page almost never worked.

Update

After few hours of trying numerous page loads approaches and other things, I found where the REAL problem is.

for i in range(1,40,2) was the biggest problem.

You tried to click on a listing with id 21, but there are only 21 of them. So, I've changed it to for i in range(1, 20, 2), added one wait on a new page it everything works well now. I'm living the debugging code, so everything would be clear for you. Sorry, I have no more time to check how the list looks like, but it should be easy now.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.get(
    'https://www.remax.pt/comprar?searchQueryState={%22regionName%22:%22%22,%22businessType%22:1,%22listingClass%22:1,%22page%22:1,%22sort%22:{%22fieldToSort%22:%22ContractDate%22,%22order%22:1},%22mapIsOpen%22:false}')
driver.maximize_window()
driver.implicitly_wait(15)
wait = WebDriverWait(driver, 15)
cookies = driver.find_element_by_id('rcc-decline-button')
cookies.click()

element_list = []
for j in range(1, 2569):
    try:
        print("Searching Page " + str(j))
        wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='listing-search-searchdetails-component']")))
        for i in range(1, 20, 2):
            wait.until(EC.element_to_be_clickable((By.XPATH, "(//div[@class='listing-search-searchdetails-component'])[{0}]".format(i))))
            el = driver.find_element_by_xpath("(//div[@class='listing-search-searchdetails-component'])[{0}]".format(i))
            print("Listing number " + str(i))
            link = driver.find_element_by_xpath(
                "(//div[@class='listing-search-searchdetails-component'])[{0}]".format(i))
            link.click()
            try:
                detalhes = driver.find_element_by_id('details')
                preco = driver.find_element_by_id('listing-price')
                tipo = driver.find_element_by_id('listing-title')
                freguesia = driver.find_element_by_xpath('//h5[@class="listing-address"]')
                imoveis = [detalhes.text, preco.text, tipo.text, freguesia.text]
                element_list.append(imoveis)
            finally:
                driver.find_element_by_css_selector(".modal-close-icon").click()
    finally:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        next_btn = driver.find_element_by_xpath("//a[@class='page-link'][.//span[.='Next']]")
        wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@class='page-link'][.//span[.='Next']]/span")))
        driver.execute_script("arguments[0].click();", next_btn)

P.S. What you could accomplish by this time is already good.

vitaliis
  • 4,082
  • 5
  • 18
  • 40
  • Tomorrow I'll try to make more improvements if nobody answers. – vitaliis Apr 21 '21 at 02:12
  • Hello. Checke it today. But I can't see the values of the list before it finish the scraping process? – jps17183 Apr 21 '21 at 21:27
  • And it seems to me that it only extracts some random houses – jps17183 Apr 21 '21 at 22:21
  • 1
    HI @jps17183 I saw the suggested edit, but I do not observe the same error, My output: is:Searching Page 1 1 3 5 7 9 11 13 15 17 19 Searching Page 2 1 3 5 7 9 11 13 15 17 19 Searching Page 3 – vitaliis Apr 21 '21 at 22:22
  • Yes, that's because of 2 in range() – vitaliis Apr 21 '21 at 22:22
  • I'll be out of PC for some time. Try again and write in comments of there are issues – vitaliis Apr 21 '21 at 22:23
  • How could we make time random but also within an interval... I'm afraid of bot detection, but is we get for instance random(range(15, 20)) it might generate a random amount of time at each action – jps17183 Apr 21 '21 at 22:30
  • It's a different question. Please accept my answer if your initial question was solved. – vitaliis May 03 '21 at 18:38