0

So I'm working on a web scraper function to get movie data from IMDB. While I'm able to get the data I need into a dictionary, because of how I've written the function, it appends a list of lists to the dictionary. I would like to return a single list.

So right now I'm getting dict_key: [[A,B,C,D],[E,F,G,H],...] and I want dict_key: [A,B,C,D,E,F,G,H].

Eventually, I want to then take this dictionary and convert it to a pandas dataframe with col names corresponding to the dictionary keys.

Here is my function:

It takes a list of URLs, HTML tags, and variable names and gets the movie category(s), year, and length of movie.

def web_scraper(urls, class_list, col_names):
    import requests             # Import necessary modules
    from bs4 import BeautifulSoup
    import pandas as pd
    
    class_dict = {}
    for col in col_names:
        class_dict[col] = []
    
    for url in urls:
        page = requests.get(url) # Link to the page
        soup = BeautifulSoup(page.content, 'html.parser')   # Create a soup object
    
        for i in range(len(class_list)):      # Loop through class_list and col_names
            names = soup.select(class_list[i])  # Get text
            names = [name.getText(strip=True) for name in names]   # append text to dataframe
            class_dict[col_names[i]].append(names)
    
    for class_ in class_dict:        # Here is my attempt to flatten the list
        class_ = [item for sublist in class_ for item in sublist]
        
    return class_dict

1 Answers1

0

Use list.extend instead of list.append:

def web_scraper(urls, class_list, col_names):
    import requests             # Import necessary modules
    from bs4 import BeautifulSoup
    import pandas as pd
    
    class_dict = {}
    for col in col_names:
        class_dict[col] = []
    
    for url in urls:
        page = requests.get(url) # Link to the page
        soup = BeautifulSoup(page.content, 'html.parser')   # Create a soup object
    
        for i in range(len(class_list)):      # Loop through class_list and col_names
            names = soup.select(class_list[i])  # Get text
            names = [name.getText(strip=True) for name in names]   # append text to dataframe
            class_dict[col_names[i]].extend(names)   # <--- use list.extend
               
    return class_dict
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    Awesome! Thank you so much! I knew it was going to be something like that, but couldn't get out of the rut I was in, so much appreciated. – j.c.hayes82 Apr 20 '21 at 22:06