3

I wish to scrape the home page of one of the new stackexchange websites: https://webapps.stackexchange.com/ (just once, and for only several pages, nothing that should bother the servers). If I had wanted it from stackoverflow, I know there is a database dump, but for the new stackexchange, they don't exist yet.

Here is what I want to do.

Step 1: choose URL

URL <- "https://webapps.stackexchange.com/"

Step 2: read the table

readHTMLTable(URL)  # oops, doesn't work - gives NULL

Step 2: this time, let's try it with XML

htmlTreeParse(URL) # o.k, this reads the data - but it is all in <div> - now what?

So I was able to read the page, but now the structure is in divs. How can it now be used to create the same thing as readHTMLTable ?

Community
  • 1
  • 1
Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • Duplicate? http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package – Shane Aug 20 '10 at 17:32
  • See also http://stackoverflow.com/questions/2998655/how-to-isolate-a-single-element-from-a-scraped-web-page-in-r/ – Shane Aug 20 '10 at 17:55

2 Answers2

8

You can do this with the overflowr package (with the StackExchange API). Just use the get.questions() function and supply the site prefix. It's not on CRAN since it isn't complete, but you can download it and build it.

library(overflowr)
questions <- get.questions(50)

For the statistics site, the top 5 most recent questions:

questions <- get.questions(top.n=5, site="stats.stackexchange")

Incidentally, happy to include more people working on this project because I don't have any more time to spend on it. Three of the moderators from Stats.Exchange are currently working on it.

Community
  • 1
  • 1
Shane
  • 98,550
  • 35
  • 224
  • 217
  • This looks great Shane!! Any chance that there is a download link of version for windows that was already built? – Tal Galili Aug 20 '10 at 18:12
  • Nope, sorry. You will have to check it out from svn and build it. I don't see much point in providing a download version until there's more to it. The core infrastructure is there, but you can't do basic things (like pull answers). – Shane Aug 20 '10 at 18:16
  • Great, and if you make an enhancements, please feel free to submit them back. – Shane Aug 20 '10 at 18:28
0

What are you writing this in? I wrote an application that parses out of a web scrape (link). I would be more then happy to share the logic.

Josh K
  • 28,364
  • 20
  • 86
  • 132