Scraping IMDb Review Page with lxml and requests package

Question

I want to extract the user reviews of a particular movie with help of lxml. Before that, I need to find out the number of reviews first.

An example review page is Interstellar

I found the XPath where User Reviews are found with the help of Firebug:

/html/body/div[1]/div/layer/div[4]/div[3]/div[3]/div[3]/table[2]/tbody/tr/td[2]

I have this code to extract that line

reviewPage = lxml.html.document_fromstring(requests.get("http://www.imdb.com/title/tt0816692/reviews?start=0").content)
number_of_reviews = reviewPage.xpath("/html/body/div[1]/div/layer/div[4]/div[3]/div[3]/div[3]/table[2]/tbody/tr/td[2]")[0]

However, on printing the number of reviews, I get nothing. What is the problem ?

halex · Accepted Answer · 2015-03-05T09:24:47.877

2

You can use the following line to extract the number of reviews:

number_of_reviews = int(reviewPage.xpath("//div[@id = 'tn15content']/table[2]/tr/td[2]")[0].text_content().split()[0])

You can even use your own code if you modify it a little bit. The problem lies in your XPath. Get rid of the tbody part and it works.

number_of_reviews = reviewPage.xpath("/html/body/div[1]/div/layer/div[4]/div[3]/div[3]/div[3]/table[2]/tr/td[2]")[0]

You possibly got the structure of the HTML with the help of your browser's developer tools and this adds the tbody even though it does not exist in the html. If you watch the HTML file directly through View Source (Ctrl+U) you will realize that there is no tbody in the file.

See Why does firebug add <tbody> to <table>?

edited Mar 05 '15 at 09:24

answered Mar 05 '15 at 09:11

halex

16,253
5
58
67

Why can't I access it using my xpath? – GokuShanth Mar 05 '15 at 09:22
@GokuShanth Because in your XPath you explicitly search for the `tbody` element that does not exist in the HTML file, but your browser added it as an element, so you get no result with your original XPath because it does not match any existing tree. Remove the `tbody` and your XPath works – halex Mar 05 '15 at 09:24
I was trying for this page also - http://stackoverflow.com/questions/18366211/ssrs-look-up-field-in-dataset-that-is-not-part-of-report And get the question title using xpath. The xpath is - /html/body/div[5]/div[2]/div/div[1]/h1/a But it still didn't give me the text. Why ??? – GokuShanth Mar 05 '15 at 10:29
@GokuShanth Thats the XPath if you are logged in. Your python script is not logged in to stackoverflow so you have to use a different one: `/html/body/div[5]/div[2]/div/div[2]/h1/a`. If you use the shorter and more meaningful XPath `"//div[@id='question-header']/h1/a"` it works for both cases – halex Mar 05 '15 at 12:02
Thanks a million! I think Firebug is not that great. The XPath Finder on Chrome is better. – GokuShanth Mar 06 '15 at 06:51

Scraping IMDb Review Page with lxml and requests package

1 Answers1