0

Here I have a web-scraping script that utilizes "requests" and "BeautifulSoup" modules to extract the movie names and ratings from the "https://www.imdb.com/chart/top/" website. I also extracted a short description of each movie from the link provided in "td.posterColumn a" for each tag to form the third column. For doing so, I had to create a secondary soup object and extract the summary text from it for each . Even though the method works and I'm able to form a table, the runtime is too long and that is understandable as a new soup object is created for each iteration of the row. Could anyone please suggest me a faster and efficient way to perform this operation? Also, how do I make all the rows and columns appear in its entirety in the DataFrame output? Thanks!

import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
import pdb
start_time=time.time()
response=requests.get("https://www.imdb.com/chart/top/")
soup = BeautifulSoup(response.content,"lxml")
body=soup.select("tbody.lister-list")[0]
titles=[]
ratings=[]
summ=[]
for row in body.select("tr"):
    title=row.select("td.titleColumn a")[0].get_text().strip()
    titles.append(title)
    rating=row.select("td.ratingColumn.imdbRating")[0].get_text().strip()
    ratings.append(rating)
    innerlink=row.select("td.posterColumn a")[0]["href"]
    link="https://imdb.com"+innerlink
    #pdb.set_trace()
    response2=requests.get(link).content
    soup2=BeautifulSoup(response2,"lxml")
    summary=soup2.select("div.summary_text")[0].get_text().strip()
    summ.append(summary)
df=pd.DataFrame({"Title":titles,"IMDB Rating":ratings, "Movie Summary":summ})
df.to_csv("imdbmovies.csv")
end_time=time.time()
finish=end_time-start_time
print("Runtime is {f:1.4f} secs".format(f=finish))
print(df)

Pandas DataFrame output:1

mario
  • 9,858
  • 1
  • 26
  • 42
Sujith
  • 1
  • 1
  • Whatever you are doing is `entirety`. You are not constructing a new DF for every output. – bigbounty Jul 20 '20 at 01:58
  • Actually, what I meant is how do I display all the rows and all the columns in a single table format? In the output as you can see the "Movie Summary" column is situated below the other 2 columns. I want all the columns side-by-side without the last column having any "......" at the end such that I can scroll horizontally to view the full DataFrame. Any way I can do that? Thanks! – Sujith Jul 20 '20 at 02:41
  • Refer this. - https://stackoverflow.com/questions/19124601/pretty-print-an-entire-pandas-series-dataframe – bigbounty Jul 20 '20 at 02:43

0 Answers0