I have some code that goes through the cast list of a show or movie on Wikipedia. Scraping all the actor's names and storing them. The current code I have finds all the <a>
in the list and stores their title tags. It currently goes:
from bs4 import BeautifulSoup
URL = input()
website_url = requests.get(URL).text
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').find_all('a'):
title = x.get('title')
print (title)
if title is not None:
Stars.append(title)
else:
continue
While this partially works there are two downsides:
- It doesn't work if the actor doesn't have a Wikipedia page hyperlink.
- It also scrapes any other hyperlink title it finds. e.g. https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull returns
['Harrison Ford', 'Indiana Jones (character)', 'Bullwhip', 'Cate Blanchett', 'Irina Spalko', 'Bob cut', 'Rosa Klebb', 'From Russia with Love (film)', 'Karen Allen', 'Marion Ravenwood', 'Ray Winstone', 'Sallah', 'List of characters in the Indiana Jones series', 'Sexy Beast', 'Hamstring', 'Double agent', 'John Hurt', 'Ben Gunn (Treasure Island)', 'Treasure Island', 'Courier', 'Jim Broadbent', 'Marcus Brody', 'Denholm Elliott', 'Shia LaBeouf', 'List of Indiana Jones characters', 'The Young Indiana Jones Chronicles', 'Frank Darabont', 'The Lost World: Jurassic Park', 'Jeff Nathanson', 'Marlon Brando', 'The Wild One', 'Holes (film)', 'Blackboard Jungle', 'Rebel Without a Cause', 'Switchblade', 'American Graffiti', 'Rotator cuff']
Is there a way I can get BeautifulSoup to scrape the first two Words after each <li>
? Or even a better solution for what I am trying to do?