Web scraping with R, XML Package - paths on web browser are different from parsed HTML download in R

Question

I am web scraping this website (in portuguese).

When you are using google chrome, the xpath command //div[@class='result-ofertas']//span[@class='location']/a[1] correctly returns the neighborhood of the apartments for sale. You can try this yourself with Chrome's extension xpath helper.

Ok. So I try to download the website with R to automate the extraction of the data, with the XML package:

library(XML)    
site <- "http://www.zap.com.br/imoveis/sao-paulo+sao-paulo/apartamento-padrao/aluguel/?rn=104123456&pag=1"
html.raw <- htmlTreeParse(site,useInternalNodes=T, encoding="UTF-8")

But when I download the website in R, the page source is not the same anymore.

The previous xpath command results in null:

xpathApply(html.raw, "//div[@class='result-ofertas']//span[@class='location']/a[1]", xmlValue)

But if you mannualy download the website to your computer instead of downloading it with R, the xpath above works just fine.

It seems that R is downloading another webpage (a "mobile" one, it is downloading this one instead of the correct one), and not the one that it is shown in Chrome.

My problem is not with how to extract the information of this "different" page that R is downloading. I can actually deal with that with the xpath command below:

xpathApply(html.raw, "//p[@class='local']", xmlValue)

But I really would like to understand why and how this is happening.

More specifically:

What is happening here?
Why are the two different webpages (Chrome's and R's), even though the address is the same?
Is there a way to force R to download the exact webpage I see in Chrome (this would be useful, because I usually test the xpath commands with the xpath helper extension).

score 0 · Accepted Answer · answered Feb 02 '14 at 04:55

0

The site is most likely redirecting requests based on the user agent. Try setting the request user agent in R to match your Chrome user agent (which can be seen on the network tab of the developer tools. Just select a request and view the headers).

answered Feb 02 '14 at 04:55

joemfb

3,056
20
19

If it is not too much trouble, could you be more specific about where in chrome I can find the information and what command in R I have to write to ensure it will match the user agent? Thanks! – Carlos Cinelli Feb 02 '14 at 11:04
I have found the following: `User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36` – Carlos Cinelli Feb 02 '14 at 11:09

score 0 · Answer 2 · answered Feb 02 '14 at 18:24

I have solved the problem with the download.file() function from the utils package. I first download the file to the HD and then parse it. It takes a long time though, this is not an optimal solution, and I am still not sure why this is happening. So if anyone else has another solution/answer...

Web scraping with R, XML Package - paths on web browser are different from parsed HTML download in R

2 Answers2

Linked