I'm trying to scrape some data from IMDb (with selenium
in Python), but I have a problem. For each movie I have to fetch directors and writers. Both elements are contained in two tables and they have the same @class
. I need to distinguish the two tables when I scrape, otherwise sometimes the program could fetch a writer as a director and vice versa.
I've tried to use relative XPATH
to find all elements (tables) with that xpath and then put them in a loop where I try to distinguish them trough table title (that is a h4
element) and preceding-sibling
function. The code works, but it do not find anything (everytime it returns nan
).
This is my code:
counter = 1
try:
driver.get('https://www.imdb.com/title/' + tt + '/fullcredits/?ref_=tt_cl_sm')
ssleep()
tables = driver.find_elements(By.XPATH, '//table[@class="simpleTable simpleCreditsTable"]/tbody')
counter = 1
for table in tables:
xpath_table = f'//table[@class="simpleTable simpleCreditsTable"]/tbody[{counter}]'
xpath_h4 = xpath_table + "/preceding-sibling::h4[1]/text()"
table_title = driver.find_element(By.XPATH, xpath_h4).text
if table_title == "Directed by":
rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
for row in rows_director:
director = row.find_elements(By.CSS_SELECTOR, 'a')
director = [x.text for x in director]
if len(director) == 1:
director = ''.join(map(str, director))
else:
director = ', '.join(map(str, director))
director_list.append(director)
counter += 1
except NoSuchElementException:
# director = np.nan
director_list.append(np.nan)
Can any of you tell me why it doesn't work? Perhaps there is a better solution. I hope for your help.
(here you can find an example of the page I need to scrape: https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm)