9

As an intern in an economic research team, I was given the task to find a way to automatically collect specific data on a real estate ad website, using R.

I assume that the concerned packages are XML and RCurl, but my understanding of their work is very limited.

Here is the main page of the website: http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/?f=a&th=1&zz=59000 Ideally, I'd like to construct my database so that each row corresponds to an ad.

Here is the detail of an ad: http://www.leboncoin.fr/ventes_immobilieres/197284216.htm?ca=17_s My variables are: the price ("Prix"), the city ("Ville"), the surface ("surface"), the "GES, the "Classe énergie" and the number of room ("Pièces"), as well as the number of pictures shown in the ad. I would also like to export the text in a character vector over which I would perform a text mining analysis later on.

I'm looking for any help, link to a tutorial or How-to that would give me a lead over the path to follow.

Naveen
  • 6,786
  • 10
  • 37
  • 85
Alexis Matelin
  • 194
  • 1
  • 9
  • http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package – Chase May 13 '11 at 11:30

2 Answers2

12

You can use the XML package in R to scrape this data. Here is a piece of code that should help.

# DEFINE UTILITY FUNCTIONS

# Function to Get Links to Ads by Page
get_ad_links = function(page){
  require(XML)
  # construct url to page
  url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/"
  url      = paste(url_base, "?o=", page, "&zz=", 59000, sep = "")
  page     = htmlTreeParse(url, useInternalNodes = T)

  # extract links to ads on page
  xp_exp   = "//td/a[contains(@href, 'ventes_immobilieres')]"
  ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href")
  return(ad_links)  
}

# Function to Get Ad Details by Ad URL
get_ad_details = function(ad_url){
   require(XML)
   # parse ad url to html tree
   doc = htmlTreeParse(ad_url, useInternalNodes = T)

   # extract labels and values using xpath expression
   labels  = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue)
   values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue)
   values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue)
   values  = c(values1, values2)

   # convert to data frame and add labels
   mydf        = as.data.frame(t(values))
   names(mydf) = labels
   return(mydf)
}

Here is how you would use these functions to extract information into a data frame.

# grab ad links from page 1
ad_links = get_ad_links(page = 1)

# grab ad details for first 5 links from page 1
require(plyr)
ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text')

This returns the following output

Prix :     Ville :  Frais d'agence inclus :  Type de bien :  Pièces :  Surface :  Classe énergie :          GES : 
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
170 000 € 59000 Lille                     <NA>     Appartement      <NA>      50 m2  D (de 151 à 230) D (de 21 à 35)

You can easily use the apply family of functions to loop over multiple pages to get details of all ads. Two things to be mindful of. One is the legality of scraping from the website. Two is to use Sys.sleep in your looping function so that the servers are not bombarded with requests.

Let me know how this works

Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • I don't know how to thank you. The function get_ad_details works just fine, but the function get_ad_links get stuck at the "paste(url_base, "?o=", page, "&zz=", 59000, sep = "")" part, with this error message : "Erreur dans paste(url_base, "?o=", page, "&zz=", 59000, sep = "") : cannot coerce type 'closure' to vector of type 'character'". My guess was that the functions calls for the "page" object which is only created afterward. – Alexis Matelin May 18 '11 at 12:32
  • 1
    @alexis. The `get_ad_links` function contained an unbalanced `(`, which i fixed now. try it again and it should work. let me know :) – Ramnath May 18 '11 at 13:40
4

That's quite a big question, so you need to break it down into smaller ones, and see which bits you get stuck on.

Is the problem with retrieving a web page? (Watch out for proxy server issues.) Or is the tricky bit accessing the useful bits of data from it? (You'll probably need to use xPath for this.)

Take a look at the web-scraping example on Rosetta code and browse these SO questions for more information.

Community
  • 1
  • 1
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • Well i am dealing with two issues : first, how to access and extract the usefull bits of data from the ad so that it can be exploited, and then how to automatize the process in order to use it on the whole website. I have to precise that I am new to these procedures, as well as to R. – Alexis Matelin May 13 '11 at 14:50