Create a corpus with a single file (webpage)

Question

I want to read a single file (the file is a html document) from my computer and store it in a Corpus (I'm using the package tm).

Do you have any solution to do that?

Here is what I tried :

data<-read.csv(fileName)
c2<-Corpus(VectorSource(data))

it mostly works, but I sometime get the error : more columns than column names

I guess I'm not supposed to use read.csv for a webpage, as I didn't find a better solution.

Thanks for your help =)

Check out [this previous question on extracting text from HTML](http://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page). — Matt Parker, Mar 22 '12 at 14:45

Dason · Accepted Answer · 2012-04-11T22:19:02.270

8

A webpage definitely does not conform to the specifications that a CSV should. Instead you probably want to use the readHTMLTable function from the XML package.

This is grabbing from an actual webpage but it should be the same idea

file <- "http://xkcd.com/"
dat <- readLines(file)
c2 <- Corpus(VectorSource(dat))

edited Apr 11 '12 at 22:19

answered Mar 22 '12 at 14:13

Dason

60,663
9
131
148

Thanks for your answer, but I don't see how am I supposed to create a corpus with the result of readHTMLTable. Would you mind giving me an example please ? – Simon-Okp Mar 22 '12 at 16:31
@user1278743 Just to clarify, you're specifically looking for the text of the HTML page, right? You don't really need to extract the page's tables for your corpus, correct? – Matt Parker Mar 22 '12 at 17:08
@user1278743 Ah I see - What would you like returned? Just the text displayed? The html tags as well? Can you provide more details about what you expect to be returned. – Dason Mar 22 '12 at 17:21
yes, i want the text, with the html tags and stuff. I want to put the entire text, in a corpus. – Simon-Okp Mar 23 '12 at 09:54

Create a corpus with a single file (webpage)

1 Answers1