6

I'm able to scrape data off of basic html pages, but I'm having trouble scraping off the site below. It looks like the data is presented via javascript, and I'm not sure how to approach that issue. I'd prefer to use R to scrape, if possible, but could also use Python.

Any ideas/suggestions?

Edit: I need to grab the Year/Manufacturer/Model, the S/N, the Price, the Location, and the short description (starts with "Auction:") for each listing.

http://www.machinerytrader.com/list/list.aspx?bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial

Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66
Don S
  • 231
  • 2
  • 9
  • 1
    Look into Selenium. There are a few examples of its use via R here on SO, but not many. – Thomas Mar 05 '14 at 17:11
  • 1
    Use [CasperJS](http://casperjs.org/), it lets you connect to the page, and wait for elements to be loaded. You can also inject JavaScript directly into the page context. – Andrei Nemes Mar 05 '14 at 17:17

2 Answers2

3
library(XML) 
library(relenium)

##downloading website
website<- firefoxClass$new() 
website$get("http://www.machinerytrader.com/list/list.aspx?pg=1&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial") 
doc <- htmlParse(website$getPageSource())

##reading tables and binding the information
tables <- readHTMLTable(doc, stringsAsFactors=FALSE)
data<-do.call("rbind", tables[seq(from=8, to=56, by=2)])
data<-cbind(data, sapply(lapply(tables[seq(from=9, to=57, by=2)],  '[[', i=2), '[', 1))
rownames(data)<-NULL
names(data) <- c("year.man.model", "s.n", "price", "location", "auction")

This will give you what you want for the first page (showing just the first two lines here):

head(data,2)
      year.man.model      s.n      price location                                               auction
1 1972 AMERICAN 5530 GS14745W US $50,100       MI                   Auction: 1/9/2013; 4,796 Hours;  ..
2 AUSTIN-WESTERN 307      307  US $3,400       MT Auction: 12/18/2013;  AUSTIN-WESTERN track excavator.

To get all pages, just loop over them, pasting the pg=i in the address.

Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66
  • 1
    Thanks for the quick response. When I run this code, however, I get null results. The readHTMLTable command doesn't seem to actually read anything. It just produces a null list. Any idea? – Don S Mar 05 '14 at 23:06
  • 1
    Also - I'm using windows 7, if that makes a difference. – Don S Mar 06 '14 at 00:14
  • Thanks for pointing that out, you are right, I was indeed using a different setup that allowed direct download. I updated the answer, first downloading the source with `relenium` and then using `readHTMLTable`, it should work now! – Carlos Cinelli Mar 06 '14 at 02:33
2

Using Relenium:

require(relenium) # More info: https://github.com/LluisRamon/relenium
require(XML)
firefox <- firefoxClass$new() # init browser
res <- NULL
pages <- 1:2
for (page in pages) {
  url <- sprintf("http://www.machinerytrader.com/list/list.aspx?pg=%d&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial", page)
  firefox$get(url) 
  doc <- htmlParse(firefox$getPageSource())
  res <- rbind(res, 
               cbind(year_manu_model = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[1]', xmlValue),
                     sn = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[2]', xmlValue),
                     price = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[3]', xmlValue),
                     loc = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[4]', xmlValue),
                     auc = xpathSApply(doc, '//table[substring(@id, string-length(@id)-9) = "tblContent"]/tbody/tr/td[2]', xmlValue))
  )
}
sapply(as.data.frame(res), substr, 0, 30)                        
#      year_manu_model                  sn               price         loc   auc                               
# [1,] " 1972 AMERICAN 5530"            "GS14745W"       "US $50,100"  "MI " "\n\t\t\t\t\tAuction: 1/9/2013; 4,796" 
# [2,] " AUSTIN-WESTERN 307"            "307"            "US $3,400"   "MT " "\n\t\t\t\t\tDetails & Photo(s)Video(" 
# ...
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Installed relenium, but I get "Error: WebDriverException" when I run your exact code above. Any idea on what might be causing this? – Don S Mar 06 '14 at 00:10
  • @ lukeA - the error is gone, but the "auc" field has two issues: 1) it's not pulling the full text, and 2) it alternately pulls the "Details & Photos" text for some reason (example: the 1st record pulls the auction data, the 2nd record pulls Details & Photos, the 3rd record pulls auction data...). Any idea? – Don S Mar 06 '14 at 00:54
  • Figured out the first issue - just set the sapply argument from 30 to 300. Also seeing that the "auc" field is pulling in \n\t\t\t\t\ for some reason. – Don S Mar 06 '14 at 01:02
  • @user3384596 `sapply` shortened the output a bit, which is stored in `res`. You should be able to strip trailing control characters easily using e.g. `stringr::str_trim()` or `tm::stripWhitespace()` or just `gsub`. To the other issue: adapt the xpath to fit your needs. – lukeA Mar 06 '14 at 01:57