-1

I'm trying to scrape data off a website in R using the XML package, but I'm not getting any results. My code is below. The results are NULL. The first line turns up a null result (it's not finding any tables).

url = http://www.machinerytrader.com/list/list.aspx?pg=1&ETID=5&catid=1015&SO=26&mdlx=contains&bcatid=4&Pref=0&Thumbs=1&scf=false&units=imperial

Code:

tables <- readHTMLTable(url, stringsAsFactors=FALSE)
data<-do.call("rbind", tables[seq(from=8, to=56, by=2)])
data<-cbind(data, sapply(lapply(tables[seq(from=9, to=57, by=2)],  '[[', i=2), '[', 1))
rownames(data)<-NULL
names(data) <- c("year.man.model", "s.n", "price", "location", "auction")
head(data)

Any help would be greatly appreciated!

Don

Don S
  • 231
  • 2
  • 9
  • your first line gives me a list of 0. – JeremyS Mar 06 '14 at 00:49
  • Yeah that's definitely where the issue stems from, but I can't figure out why. I'll edit original question to make that clear. – Don S Mar 06 '14 at 00:54
  • Seems like the table is generated by javascript, that makes it a bit more challenging, but have a search and you might get some useful code – Ben Mar 06 '14 at 01:18

2 Answers2

0

It looks like it's a wretchedly built site issue. Doing the following "manually":

library(RCurl)
library(XML)

url <- "http://www.machinerytrader.com/list/list.aspx?pg=1&ETID=5&catid=1015&SO=26&mdlx=contains&bcatid=4&Pref=0&Thumbs=1&scf=false&units=imperial"
pg <- getURL(url)
conn <- textConnection(pg)
pg <- readLines(conn)
close(conn)

has at element [33] of pg (in this particular call):

pg[33]
[1] "<noscript>Please enable JavaScript to view the page content.</noscript>" 

I usually do a quick debug in Google Spreadsheets via the IMPORTHTML function (I actually prefer letting Google handle the data import and transformation in general) and it couldn't even scrape the page.

I tried it with both command-line curl and wget and (unsurprisingly) got the same result.

You may need to go this route: Scraping websites with Javascript enabled? to get what you need. I might be missing something obvious, though.

Community
  • 1
  • 1
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
0

Got an answer on a different thread. Basically, you need to use the relenium package in R.

Solution: Scraping javascript website

Community
  • 1
  • 1
Don S
  • 231
  • 2
  • 9