5

I am quite new to R. I want to compile a 1-million-word corpus of newspaper articles. So I am trying to write a web scraper to retrieve newspaper articles from e.g. the guardian website: http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs.

The scraper is meant to start on one page, retrieve the article's body text, remove all tags and save it to a text file. Then it should go to the next article via the links on this page, get the article and so on until the file contains about 1 million words.

Unfortunately, I did not get very far with my scraper.

I used readLines() to get to the website's source and would now like to get hold of the relevant line in the code.

The relevant section in the Guardian uses this id to mark the body text of the article:

<div id="article-body-blocks">         
  <p>
    <a href="http://www.guardian.co.uk/politics/boris"
       title="More from guardian.co.uk on Boris Johnson">Boris Johnson</a>,
       the...a different approach."
  </p>
</div>

I tried to get hold of this section using various expressions with grep and lookbehind - trying to get the line after this id - but I think it does not work across multiple lines. At least I cannot get it to work.

Could anybody help out? It would be great if somebody could provide me with some code I can continue working on!

Thanks.

isomorphismes
  • 8,233
  • 9
  • 59
  • 70
Kat
  • 63
  • 2
  • 4
  • 4
    Any particular reason why you don't make use of the R packages that support web scraping, e.g. RCurl, XML or scrapeR? See for example http://stackoverflow.com/q/5830705/602276, http://stackoverflow.com/q/7501148/602276, http://stackoverflow.com/q/3746256/602276 and http://stackoverflow.com/q/4882123/602276 – Andrie Oct 31 '11 at 18:45
  • thanks! well, yes, I have read up on that, too. It just seems more difficult to me - as I am not very good with R yet. So I found this here http://www.programmingr.com/content/webscraping-using-readlines-and-rcurl which looked easy to use. Just the HTML of my website is a bit more complicated. – Kat Oct 31 '11 at 19:04
  • 2
    You HAVE read the Guardian's T+Cs haven't you? "Except as expressly authorised by the Guardian, you are not allowed to create a database in electronic or paper form comprising all or part of the material appearing on the Guardian Site". – Spacedman Nov 01 '11 at 12:46
  • 1
    @spacedman, yes, I read their T+Cs. They allow personal, non-commercial use though: "You may download and print extracts from the Guardian Content for your own personal and non-commercial use only [...]". Anyway, I just use the site as a "template" to build the scraper. – Kat Nov 01 '11 at 21:54
  • I reckon the 'no creating databases' rule overrides the personal and non-commercial download and print rule, but hey, the lawyers won't be chasing you :) – Spacedman Nov 01 '11 at 21:58

1 Answers1

14

You will face the problem of cleaning of the scraped page if you really insist on using grep and readLines, but this can be done of course. Eg.:

Load the page:

html <- readLines('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')

And with the help of str_extract from stringr package and a simple regular expression you are done:

library(stringr)
body <- str_extract(paste(html, collapse='\n'), '<div id="article-body-blocks">.*</div>')

Well, body looks ugly, you will have to clean it up from <p> and scripts also. This can be done with gsub and friends (nice regular expressions). For example:

gsub('<script(.*?)script>|<span(.*?)>|<div(.*?)>|</div>|</p>|<p(.*?)>|<a(.*?)>|\n|\t', '', body)

As @Andrie suggested, you should rather use some packages build for this purpose. Small demo:

library(XML)
library(RCurl)
webpage <- getURL('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[@id='article-body-blocks']/p", xmlValue)

Where body results in a clean text:

> str(body)
 chr [1:33] "The deputy prime minister, Nick Clegg, has said the government's regional growth fund will provide a \"snowball effect that cre"| __truncated__ ...

Update:The above as a one-liner (thanks to @Martin Morgan for suggestion):

xpathSApply(htmlTreeParse('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs', useInternalNodes = TRUE, encoding='UTF-8'), "//div[@id='article-body-blocks']/p", xmlValue)
daroczig
  • 28,004
  • 7
  • 90
  • 124
  • 1
    +1 Nice illustration of the power of `htmlTreeParse` and `xpathSApply` – Andrie Oct 31 '11 at 22:08
  • 2
    No need for RCurl / getURL / readLines / textConnection; `htmlTreeParse` reads urls. – Martin Morgan Nov 01 '11 at 00:23
  • thank you very much for your help daroczig! all of you, really. this is great! Will get to work on it and hopefully manage the last steps to my scraper on my own. – Kat Nov 01 '11 at 08:37
  • Thank you @MartinMorgan, this makes things a lot simpler, I will have to update my scripts :) I've edited my answer now based on your suggestion. – daroczig Nov 01 '11 at 11:44
  • This [question](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) comes to the point, regex should´t be used to parse HTML documents. – marbel Mar 25 '14 at 15:02