I am quite new to R. I want to compile a 1-million-word corpus of newspaper articles. So I am trying to write a web scraper to retrieve newspaper articles from e.g. the guardian website: http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs.
The scraper is meant to start on one page, retrieve the article's body text, remove all tags and save it to a text file. Then it should go to the next article via the links on this page, get the article and so on until the file contains about 1 million words.
Unfortunately, I did not get very far with my scraper.
I used readLines() to get to the website's source and would now like to get hold of the relevant line in the code.
The relevant section in the Guardian uses this id to mark the body text of the article:
<div id="article-body-blocks">
<p>
<a href="http://www.guardian.co.uk/politics/boris"
title="More from guardian.co.uk on Boris Johnson">Boris Johnson</a>,
the...a different approach."
</p>
</div>
I tried to get hold of this section using various expressions with grep and lookbehind - trying to get the line after this id - but I think it does not work across multiple lines. At least I cannot get it to work.
Could anybody help out? It would be great if somebody could provide me with some code I can continue working on!
Thanks.