Remove chararcters in text corpus

Question

I'm analyzing a corpus of emails. Some emails contain URLs. When I apply the removePunctuation function from the tm library, I get httpwww, and then I lose the info of a web address. What I would like to do, is to replace the "://" with " " across all of the corpus. I tried gsub, but then I the datatype of the corpus changes and I can't continue to process it with tm package.

Here is an example:

As you can see, gsub changes the class of the corpus to an array of characters, causing tm_map to fail.

> corpus
# A corpus with 4257 text documents
> corpus1 <- gsub("http://","http ",corpus)
> class(corpus1)
# [1] "character"
> class(corpus)
# [1] "VCorpus" "Corpus"  "list"   
> cleanSW <- tm_map(corpus1,removeWords, stopwords("english"))
# Error in UseMethod("tm_map", x) : 
# no applicable method for 'tm_map' applied to an object of class "character"
> cleanSW <- tm_map(corpus,removeWords, stopwords("english"))
> cleanSW
# A corpus with 4257 text documents

How can I bypass it? Maybe there's a way to convert it back to corpus from array of characters?

what about the other `/` and the possible `.` or `:` in the web address? — James, May 28 '14 at 09:42
Same issue, I just gave the :// as an example, but as you mentioned it applies to some more characters as well. — Yoav, May 29 '14 at 06:44
You haven't gotten any help because you haven't provided data. A minimal working example is almost always required. — Tyler Rinker, Jun 10 '14 at 20:39

score 2 · Accepted Answer · edited May 23 '17 at 12:29

2

Found a solution to this problem here: Removing non-English text from Corpus in R using tm(), Corpus(VectorSource(dat1)) worked for me.

edited May 23 '17 at 12:29

Community

1
1

answered Jul 31 '14 at 15:19

Lot

36
3

Remove chararcters in text corpus

1 Answers1