1

I was just trying to scrape the results framework information for several projects on the world bank's site. The library that I am using is scrapy but am open to even using selenium.

Link: (https://projects.worldbank.org/en/projects-operations/project-detail/P153012)

The problem that I am facing is:

  1. The tables are dynamically generated and for some projects they would be completely missing or have lesser fields (this ensures I can't use scrapy as I don't know how to deal with javascript using scrapy)

  2. With selenium the code I am using is as follows, but this only allows me to extract all the text and not individual cell items (can the same be done or am i just trying to do a fool's errand):

from selenium import webdriver

url = "https://projects.worldbank.org/en/projects-operations/project-detail/P153012"
driver = webdriver.Chrome(executable_path = "/Users/thenewcomputer/Downloads/chromedriver")
driver.get(url)
tables = driver.find_elements_by_class_name("ng-tns-c7-3")
for table in tables:
    title = table.find_elements_by_xpath('//*[@id="results"]/div/div/div[2]/div/div[1]/div/div/ul/li/table')
title
for x in title:
    print(x.text) #because i wanted to figure out if this was working correctly

Do let me know if there is an easier way of doing this and thanks in advance

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Anikan
  • 13
  • 2

1 Answers1

0

To print the texts from the tables, as an example from the table with heading Results Framework you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategy:

  • Code Block:

    driver.get("https://projects.worldbank.org/en/projects-operations/project-detail/P153012")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h2[starts-with(., 'Results Framework')]//following::div[1]//ul//table"))).text)
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Console Output:

    Increase of Municipality of Fortaleza own-source revenue capacity through planning and land-value capture instruments Value 0 - 17.75% increase in property tax revenues- 546.2% increase in PMF's revenues through Fortaleza online- 171.90% increase inSEUMA's revenues collected from use of urban instruments - 20% increase in property tax revenues- 100% increase in PMF's revenues through Fortaleza Online- 115% increase in SEUMA'srevenues collected from use of urban instruments
    Date August 1, 2016 April 28, 2021 June 30, 2023
    

Comment

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Hi thanks, this helped a lot for the other tables but for the results framework table (the last table in the page), I can't find the h3 tag. Also given that this is a print output is there a way to seperate each cell's output as one output or would that only be possible with regex – Anikan Dec 27 '21 at 16:07
  • @Anikan Can you raise a new question for your new requirement please? – undetected Selenium Dec 27 '21 at 16:08
  • 1
    Hi Debanjan but the original query was also about just the information in "the results framework information". But thanks for all your help – Anikan Dec 27 '21 at 16:26
  • @Anikan Checkout the updated answer and let me know the status. – undetected Selenium Dec 27 '21 at 16:37