0

I am new with Python and trying to scrape IMDB. I am scraping a list of 250 top IMDB movies and want to get information on each unique website for example the length of each movie.

I already have a list of unique URLs. So, I want to loop over this list and for every URL in this list I want to retrieve the 'length' of that movie. Is this possible to do in one code?

for URL in urlofmovie:
    htmlsource = requests.get(URL)
    tree_url = html.fromstring(htmlsource)
    lengthofmovie = tree_url.xpath('//*[@class="subtext"]')

I expect that lengthofmovie will become a list of all the lengths of the movies. However, it already goes wrong at line 2: the htmlsource.

Tom de Geus
  • 5,625
  • 2
  • 33
  • 77
Marieke
  • 9
  • 1
  • 1
    What is in urlofmovie? can you post the full code. What error are you getting? – Andrew Daly May 13 '19 at 10:58
  • 1
    Possible duplicate of [Does IMDB provide an API?](https://stackoverflow.com/questions/1966503/does-imdb-provide-an-api) – Sayse May 13 '19 at 11:03
  • "I expect that 'lengthofmovie' will become a list of all the lengths of the movies" => it will not - no language has mind-reading abilities, so if you want a list you have to use a list. – bruno desthuilliers May 13 '19 at 11:08
  • "However, it already goes wrong for at line 2: the htmlsource." => that's a different question. Please post one question per problem. Also, when you have an error in your code, you're supposed to post the exact error message and the full traceback - but in this case, the error is very probably due to the fact that `requests.get` returns a `HTTPResponse` object, not a string. You want the response's `.text` attribute instead (cf `requests` doc). – bruno desthuilliers May 13 '19 at 11:12
  • What if I suggest you a better way to do this? Here try this - https://pypi.org/project/IMDbPY/ – Underoos May 13 '19 at 11:16
  • be sure to check the terms and conditions of sites before you decide you want to webscrape. IMBD Terms of Use states, `"Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below."` – chitown88 May 13 '19 at 15:35

1 Answers1

2

To make it a list you should first create a list and then append each length to that list.

length_list = []
for URL in urlofmovie:
    htmlsource = requests.get(URL)
    tree_url = html.fromstring(htmlsource)
    length_list.append(tree_url.xpath('//*[@class="subtext"]'))

Small tip: Since you are new to Python I would suggest you to go over PEP8 conventions. Your variable naming can make your(and other developers) life easier. (urlofmovie -> urls_of_movies)

However, it already goes wrong for at line 2: the htmlsource.

Please provide the exception you are receiving.

andreygold
  • 192
  • 2
  • 13