1

I have some code that goes through the cast list of a show or movie on Wikipedia. Scraping all the actor's names and storing them. The current code I have finds all the <a> in the list and stores their title tags. It currently goes:

from bs4 import BeautifulSoup
URL = input() 
website_url = requests.get(URL).text   
section = soup.find('span', id='Cast').parent

Stars = []
for x in section.find_next('ul').find_all('a'):
    title = x.get('title')
    print (title)
    if title is not None:
        Stars.append(title)
    else:
        continue

While this partially works there are two downsides:

  1. It doesn't work if the actor doesn't have a Wikipedia page hyperlink.
  2. It also scrapes any other hyperlink title it finds. e.g. https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull returns ['Harrison Ford', 'Indiana Jones (character)', 'Bullwhip', 'Cate Blanchett', 'Irina Spalko', 'Bob cut', 'Rosa Klebb', 'From Russia with Love (film)', 'Karen Allen', 'Marion Ravenwood', 'Ray Winstone', 'Sallah', 'List of characters in the Indiana Jones series', 'Sexy Beast', 'Hamstring', 'Double agent', 'John Hurt', 'Ben Gunn (Treasure Island)', 'Treasure Island', 'Courier', 'Jim Broadbent', 'Marcus Brody', 'Denholm Elliott', 'Shia LaBeouf', 'List of Indiana Jones characters', 'The Young Indiana Jones Chronicles', 'Frank Darabont', 'The Lost World: Jurassic Park', 'Jeff Nathanson', 'Marlon Brando', 'The Wild One', 'Holes (film)', 'Blackboard Jungle', 'Rebel Without a Cause', 'Switchblade', 'American Graffiti', 'Rotator cuff']

Is there a way I can get BeautifulSoup to scrape the first two Words after each <li>? Or even a better solution for what I am trying to do?

  • 1
    `x.get('title')` returns a string so you can just split(), pick only the first two "words", then join(). E.g., `title = ' '.join(title.split(' ')[:2])`. –  Feb 06 '21 at 18:58

3 Answers3

0

You can use css selectors to grab only the first <a> in a <li>:

for x in section.find_next('ul').select('li > a:nth-of-type(1)'):

Example

from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull#Cast'
website_url = requests.get(URL).text   
soup = BeautifulSoup(website_url,'lxml')
section = soup.find('span', id='Cast').parent

Stars = []
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
        Stars.append(x.get('title'))
Stars

Output

['Harrison Ford',
 'Cate Blanchett',
 'Karen Allen',
 'Ray Winstone',
 'John Hurt',
 'Jim Broadbent',
 'Shia LaBeouf']
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
0

You can use Regex to fetch all the names from the text content of <li/> and just take the first two names and it will also fix the issue in case the actor doesn't have a Wikipedia page hyperlink

import re
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)", <text_content_from_li>)

Example:

text = "Cate Blanchett as Irina Spalko, a villainous Soviet agent. Screenwriter David Koepp created the character."
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)",text)

Output:
[('Cate', 'Blanchett'), ('Irina', 'Spalko'), ('Screenwriter', 'David')]

Rahul Arora
  • 131
  • 2
0

There is considerable variation for the html for cast within the film listings on Wikipaedia. Perhaps look to an API to get this info?

E.g. imdb8 allows for a reasonable number of calls which you could use with the following endpoint

https://imdb8.p.rapidapi.com/title/get-top-cast

There also seems to be Python IMDb API


Or choose something with more regular html. For example, if you take the imdb film ids in a list you can extract full cast and main actors, from IMDb as follows. To get the shorter cast list I am filtering out the rows which occur at/after the text "Rest" within "Rest of cast listed alphabetically:"


import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

movie_ids = ['tt0367882', 'tt7126948']   
base = 'https://www.imdb.com'

with requests.Session() as s:
   
    for movie_id in movie_ids:
        link = f'https://www.imdb.com/title/{movie_id}/fullcredits?ref_=tt_cl_sm'
        # print(link)
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        print(soup.select_one('title').text)
        full_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list [href*=name]:has(img)')] 
        main_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list tr:not(:has(.castlist_label:contains(cast)) ~ tr, :has(.castlist_label:contains(cast))) [href*=name]:has(img)')]
        df_full = pd.DataFrame(full_cast, columns = ['Actor', 'Link'])
        df_main = pd.DataFrame(main_cast, columns = ['Actor', 'Link'])
        # print(df_full)
        print(df_main)
QHarr
  • 83,427
  • 12
  • 54
  • 101