So I'm working on a web scraper function to get movie data from IMDB. While I'm able to get the data I need into a dictionary, because of how I've written the function, it appends a list of lists to the dictionary. I would like to return a single list.
So right now I'm getting
dict_key: [[A,B,C,D],[E,F,G,H],...]
and I want dict_key: [A,B,C,D,E,F,G,H]
.
Eventually, I want to then take this dictionary and convert it to a pandas dataframe with col names corresponding to the dictionary keys.
Here is my function:
It takes a list of URLs, HTML tags, and variable names and gets the movie category(s), year, and length of movie.
def web_scraper(urls, class_list, col_names):
import requests # Import necessary modules
from bs4 import BeautifulSoup
import pandas as pd
class_dict = {}
for col in col_names:
class_dict[col] = []
for url in urls:
page = requests.get(url) # Link to the page
soup = BeautifulSoup(page.content, 'html.parser') # Create a soup object
for i in range(len(class_list)): # Loop through class_list and col_names
names = soup.select(class_list[i]) # Get text
names = [name.getText(strip=True) for name in names] # append text to dataframe
class_dict[col_names[i]].append(names)
for class_ in class_dict: # Here is my attempt to flatten the list
class_ = [item for sublist in class_ for item in sublist]
return class_dict