Here I have a web-scraping script that utilizes "requests" and "BeautifulSoup" modules to extract the movie names and ratings from the "https://www.imdb.com/chart/top/" website. I also extracted a short description of each movie from the link provided in "td.posterColumn a" for each tag to form the third column. For doing so, I had to create a secondary soup object and extract the summary text from it for each . Even though the method works and I'm able to form a table, the runtime is too long and that is understandable as a new soup object is created for each iteration of the row. Could anyone please suggest me a faster and efficient way to perform this operation? Also, how do I make all the rows and columns appear in its entirety in the DataFrame output? Thanks!
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
import pdb
start_time=time.time()
response=requests.get("https://www.imdb.com/chart/top/")
soup = BeautifulSoup(response.content,"lxml")
body=soup.select("tbody.lister-list")[0]
titles=[]
ratings=[]
summ=[]
for row in body.select("tr"):
title=row.select("td.titleColumn a")[0].get_text().strip()
titles.append(title)
rating=row.select("td.ratingColumn.imdbRating")[0].get_text().strip()
ratings.append(rating)
innerlink=row.select("td.posterColumn a")[0]["href"]
link="https://imdb.com"+innerlink
#pdb.set_trace()
response2=requests.get(link).content
soup2=BeautifulSoup(response2,"lxml")
summary=soup2.select("div.summary_text")[0].get_text().strip()
summ.append(summary)
df=pd.DataFrame({"Title":titles,"IMDB Rating":ratings, "Movie Summary":summ})
df.to_csv("imdbmovies.csv")
end_time=time.time()
finish=end_time-start_time
print("Runtime is {f:1.4f} secs".format(f=finish))
print(df)
Pandas DataFrame output: