1

I want to read a single file (the file is a html document) from my computer and store it in a Corpus (I'm using the package tm).

Do you have any solution to do that?

Here is what I tried :

data<-read.csv(fileName)
c2<-Corpus(VectorSource(data))

it mostly works, but I sometime get the error : more columns than column names

I guess I'm not supposed to use read.csv for a webpage, as I didn't find a better solution.

Thanks for your help =)

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Simon-Okp
  • 687
  • 7
  • 28
  • Check out [this previous question on extracting text from HTML](http://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page). – Matt Parker Mar 22 '12 at 14:45

1 Answers1

8

A webpage definitely does not conform to the specifications that a CSV should. Instead you probably want to use the readHTMLTable function from the XML package.


This is grabbing from an actual webpage but it should be the same idea

file <- "http://xkcd.com/"
dat <- readLines(file)
c2 <- Corpus(VectorSource(dat))
Dason
  • 60,663
  • 9
  • 131
  • 148
  • Thanks for your answer, but I don't see how am I supposed to create a corpus with the result of readHTMLTable. Would you mind giving me an example please ? – Simon-Okp Mar 22 '12 at 16:31
  • @user1278743 Just to clarify, you're specifically looking for the text of the HTML page, right? You don't really need to extract the page's tables for your corpus, correct? – Matt Parker Mar 22 '12 at 17:08
  • @user1278743 Ah I see - What would you like returned? Just the text displayed? The html tags as well? Can you provide more details about what you expect to be returned. – Dason Mar 22 '12 at 17:21
  • yes, i want the text, with the html tags and stuff. I want to put the entire text, in a corpus. – Simon-Okp Mar 23 '12 at 09:54