0

I want to get data within div loop, so that the values can be ordered in correct rows. Also, I want data from entire page, not only from the visible part of the page.

  1. How can I get Firm_name, Remediation_status, ... from div[@class='sc-kbGplQ bCRLdc']?
  2. The code below gives less than 20 rows, while total firms are 1800+. How can I scroll page and get data from entire page? Thanks in advance.
ruby = driver.find_elements(By.XPATH, "//div[@class='sc-kbGplQ bCRLdc']")
for i in ruby:    
#    actions.move_to_element(i).perform()
    driver.execute_script("arguments[0].scrollIntoView();", i)
    time.sleep(INTERVAL)    
    

    try:
        Firm_name = [Firm_name.text for Firm_name in i.find_elements(By.XPATH, "//div[1]/h2[@class='sc-idjmjb jDJltL']")]        
        Remediation_status = [Remediation_status.text for Remediation_status in i.find_elements(By.XPATH, "//div[1]/span[2][@class='sc-iKpIOp iKvkEG']")]
        Safety_training = [Safety_training.text for Safety_training in i.find_elements(By.XPATH, "//div[2]/span[2][@class = 'sc-iKpIOp iKvkEG']" )]
        Worker_number = [Worker_number.text for Worker_number in i.find_elements(By.XPATH, "//div[1]/h2[@class='sc-bsVVwV gnfeLF']")]
        Progress_rate = [Progress_rate.text for Progress_rate in i.find_elements(By.XPATH, "//div[2]/h2[@class= 'sc-bsVVwV gnfeLF']")]        
    except:
        print("na")
#driver.execute_script("window.scrollBy(0,500)","")
time.sleep(INTERVAL)
df1 = pd.DataFrame(data=list(zip(Firm_name, Remediation_status, Safety_training, Progress_rate, Worker_number)), columns=['Firm_name', 'Remediation_status', 'Safety_training', 'Progress_rate', 'Worker_number'])
df1.to_csv('namefirm.csv')

Ruby
  • 13
  • 2
  • 1
    Could you add a link to the webpage you're scraping from? – Oxin Mar 04 '23 at 13:08
  • Does this answer your question? [Scrolling to element using webdriver?](https://stackoverflow.com/questions/41744368/scrolling-to-element-using-webdriver) – JeffC Mar 04 '23 at 16:35
  • @JeffC Thank you for the link. command works for short scrolling but does not go farther down the page. – Ruby Mar 05 '23 at 08:10
  • The `scrollIntoView()` portion of that answer is the better option. – JeffC Mar 05 '23 at 15:44
  • @Oxin Here is the webpage link https://bangladeshaccord.org/factories – Ruby Mar 06 '23 at 01:33
  • @JeffC I edited the code, in case you like to see them. Both of the commands cannot load the page for long. For example, the current code ran error-free giving me only 36 fims, while total firm in the page is 1823. Sometimes after running a bit longer time shows "Debuggin connection was closed. Reason: Render process gone." and the webpage disappears. – Ruby Mar 06 '23 at 02:49

1 Answers1

0

When elements might not be in the page, as for example Worker_number, then it is better to use execute_script (i.e. javascript) instead of find_element or find_elements, because it returns None if the element is not in the pgae, while .find_elements returns an empty list and find_element raises an error, hence if you use find_elements then you have to add code to check if the list is empty, if you use find_element you have to add a try-except block.

The problem with scrolling to load new elements, is that if there are hundreds of elements then the it takes a lot of RAM memory and the browser might freeze. A workaround is to remove the elements from the HTML instead of scrolling. We can remove an element by using driver.execute_script('var element = arguments[0]; element.remove();', element).

As a final suggestion, use a dictionary instead of a list to store scraped data.

data = {key:[] for key in ['Firm_name', 'Remediation_status', 'Safety_training', 'Progress_rate', 'Worker_number']}
js = f'return document.evaluate(arguments[0], arguments[1], null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue?.innerText;'
max_wait = 9 # seconds

while 1:

    factories = []
    start = time.time()
    while len(factories) < 2:
        factories = driver.find_elements(By.CSS_SELECTOR, "#factories>div+div+div>div>div>div+div>div+div+div>div")
        if time.time() - start > max_wait:
            print('no new factories')
            start = -1
            break
    
    if start < 0:
        break
    else:
        for factory in factories:

            data['Firm_name']          += [driver.execute_script(js, ".//h2", factory)]
            data['Remediation_status'] += [driver.execute_script(js, ".//p[contains(.,'Remediation Status')]/span[2]", factory)]
            data['Safety_training']    += [driver.execute_script(js, ".//p[contains(.,'Safety Training Program')]/span[2]", factory)]
            data['Worker_number']      += [driver.execute_script(js, ".//h2[contains(.,'Workers')]/following-sibling::h2", factory)]
            data['Progress_rate']      += [driver.execute_script(js, ".//h2[contains(.,'Progress Rate')]/following-sibling::h2", factory)]

            driver.execute_script('var element = arguments[0]; element.remove();', factory)
            print(f"{len(data['Firm_name'])} factories scraped", end='\r')

Execution

enter image description here

Then by running pd.DataFrame(data) you get

enter image description here

sound wave
  • 3,191
  • 3
  • 11
  • 29
  • Thank you very much. I am sorry for replying late. I remember it worked well before. Just got another issue, "Your connection is not private" does not let the code run. I used the following code to get rid of this warning: options = webdriver.ChromeOptions() options.add_argument('--ignore-ssl-errors=yes') options.add_argument('--ignore-certificate-errors') options.add_argument("--allow-running-insecure-content"); driver = webdriver.Chrome(options=options); And now after 20 factory scrape it breaks. – Ruby Apr 17 '23 at 05:21
  • @Ruby Try with undetected selenium and let me know – sound wave Apr 18 '23 at 20:28
  • thank you. I used the following without any success 'import undetected_chromedriver as uc options = uc.ChromeOptions() options.add_argument('--ignore-ssl-errors=yes') options.add_argument('--ignore-certificate-errors') options.add_argument("--allow-running-insecure-content") driver = webdriver.Chrome(options=options) driver.maximize_window() URL = "https://bangladeshaccord.org/factories" driver.get(URL) ' – Ruby Apr 19 '23 at 01:36