0

I am a newbie to the web-scraping. Pardon my silly mistakes if there are any.

I have been working on a project in which I need a list of movies as my data. I am trying to collect the data from the wikipedia using web-scraping.

Following is my code for the same:

def MoviesList(years, driver):
    for year in years:
        driver.implicitly_wait(150)
        year.click()
        table = driver.find_element_by_xpath('/html/body/div[3]/div[3]/div[5]/div[1]/table[2]/tbody')
        movies = table.find_elements_by_xpath('tr/td[1]/i/a')
        for movie in movies:
            print(movie.text)
        driver.back()
years = driver.find_elements_by_partial_link_text('List of Bollywood films of')
del years[:2]
MoviesList(years, driver)

Trying to get the years list from this page and stored it in the years variable. Then, I am looping through all the years and trying to extract the top-10 movies of the year. see this for reference

Output:

Tanhaji
Baaghi 3
...
...
Panga
# Top movies of the year 2020
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document (from line year.click())

Expected Output:

Tanhaji  
...
...
War  # First movie of the year 2019
Saaho
...
...
Vikram Urvashi  # Last movie of the year 1920
# Top movies of the year from 2020 to 1920

I have already referred this and this questions but it goes in vain. I have tried Explicit Wait too, but it didn't work.

I am aware of the error that when it occurs but I don't know how to handle that error other than adding implicit or explicit wait.

What am I doing wrong? How can I improve this code to get the desired output?

Any help would be much appreciated.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Harsh Dhamecha
  • 116
  • 3
  • 12

2 Answers2

1

To collect the data from the wikipedia Lists of Bollywood films using Selenium and you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following Locator Strategies:

Note: As a demonstration this program is restricted to collect the movies from the Highest worldwide gross section for the previous three(3) years only

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get("https://en.wikipedia.org/wiki/Lists_of_Bollywood_films")
    parent_window  = driver.current_window_handle
    years = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.PARTIAL_LINK_TEXT, "List of Bollywood films of")))[2:5]]
    print(years)
    for year in years:
        driver.execute_script("window.open('" + year +"')")
        WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
        windows_after = driver.window_handles
        new_window = [x for x in windows_after if x != parent_window][0]
        driver.switch_to_window(new_window)
        print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table/caption//following::tbody[1]//td/i/a")))])
        driver.close()
        driver.switch_to_window(parent_window)
    driver.quit()
    
  • Console Output:

    ['Tanhaji', 'Baaghi 3', 'Street Dancer 3D', 'Shubh Mangal Zyada Saavdhan', 'Malang', 'Chhapaak', 'Love Aaj Kal', 'Jawaani Jaaneman', 'Thappad', 'Panga']
    ['War', 'Saaho', 'Kabir Singh', 'Uri: The Surgical Strike', 'Bharat', 'Good Newwz', 'Mission Mangal', 'Housefull 4', 'Gully Boy', 'Dabangg 3']
    ['Sanju', 'Padmaavat', 'Andhadhun', 'Simmba', 'Thugs of Hindostan', 'Race 3', 'Baaghi 2', 'Hichki', 'Badhaai Ho', 'Pad Man']
    

References

You can find a couple of relevant detailed discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • I have gone through the references. One great takeaway for me is to use the new window whenever dealing with multiple links. Big Thanks for it!. How did you figure out the XPath for movies like `//table/caption//following::tbody[1]//td/i/a`? – Harsh Dhamecha Jan 09 '21 at 07:27
  • Your code is running fine for 3 years(from 2020 to 2018), but when I tried to change the index and looked for more years, it just got stuck in the year 2017 page and I got the movies of years from 2020 to 2018 and then `TimeoutException`. How can I solve this problem? – Harsh Dhamecha Jan 09 '21 at 07:31
0
def MoviesList(linktext, driver):
    count = 0
    while(len(years)!=count):
        years = driver.find_elements_by_partial_link_text(linktext)
        del years[:2]
        year = years[count]
        count+=1
        driver.implicitly_wait(150)
        year.click()
        table = driver.find_element_by_xpath('/html/body/div[3]/div[3]/div[5]/div[1]/table[2]/tbody')
        movies = table.find_elements_by_xpath('tr/td[1]/i/a')
        for movie in movies:
            print(movie.text)
        driver.back()


MoviesList('List of Bollywood films of', driver)

you should always find years again , as you are clicking 'year' and this modifies the DOM, when ever DOM ( PAge html) is modified you have find all elements again as the previous reference is lost , this is why you get stale element

PDHide
  • 18,113
  • 2
  • 31
  • 46