I am web scraping this website (in portuguese).
When you are using google chrome, the xpath command //div[@class='result-ofertas']//span[@class='location']/a[1]
correctly returns the neighborhood of the apartments for sale. You can try this yourself with Chrome's extension xpath helper.
Ok. So I try to download the website with R to automate the extraction of the data, with the XML
package:
library(XML)
site <- "http://www.zap.com.br/imoveis/sao-paulo+sao-paulo/apartamento-padrao/aluguel/?rn=104123456&pag=1"
html.raw <- htmlTreeParse(site,useInternalNodes=T, encoding="UTF-8")
But when I download the website in R, the page source is not the same anymore.
The previous xpath command results in null:
xpathApply(html.raw, "//div[@class='result-ofertas']//span[@class='location']/a[1]", xmlValue)
But if you mannualy download the website to your computer instead of downloading it with R, the xpath above works just fine.
It seems that R is downloading another webpage (a "mobile" one, it is downloading this one instead of the correct one), and not the one that it is shown in Chrome.
My problem is not with how to extract the information of this "different" page that R is downloading. I can actually deal with that with the xpath command below:
xpathApply(html.raw, "//p[@class='local']", xmlValue)
But I really would like to understand why and how this is happening.
More specifically:
- What is happening here?
- Why are the two different webpages (Chrome's and R's), even though the address is the same?
- Is there a way to force R to download the exact webpage I see in Chrome (this would be useful, because I usually test the xpath commands with the xpath helper extension).