0

I have a list of text files in my directory, all of which are documents with multiple paragraphs. I want to read those documents and do sentiment analysis.

For example, I have one text document data/hello.txt with text like below:

"Hello world.  
 This is an apple.

 That is an orange"

I read the document in like below (there can also be multiple documents):

docs <- VCorpus(DirSource('./data/hello.txt'))

When I look at the document content docs[[1]]$content It seems like it is character vector.

[1] "hello  world"        "this is apple."      ""                   
[4] "That is an orange. " ""  

My question is how I can read in those documents so that in each document, paragraphs are concatenated into one single character string so that I can use it for sentiment analysis. (VCorpus from tm package)

Thanks a lot.

zesla
  • 11,155
  • 16
  • 82
  • 147
  • As a starting point, this post shows how to go from a directory of raw text files to analysis in R: https://stackoverflow.com/a/60321956/1839959 From there you could modify the mutate statement to add paragraph indicators if that is what you need. – Stan Feb 20 '20 at 16:11

1 Answers1

0

You could use the readtext package to read in the texts, and then construct the VCorpus using VectorSource().

txt <- "Hello world.\nThis is an apple.\n\nThat is an orange"

tf <- tempfile("temp", fileext = ".txt")
cat(txt, file = tf)

library("readtext")
rtxt <- readtext(tf)

cat(rtxt$text)
## Hello world.
## This is an apple.
## 
## That is an orange

library("tm")
## Loading required package: NLP
docs <- VCorpus(VectorSource(rtxt$text))
cat(docs[[1]]$content)
## Hello world.
## This is an apple.
## 
## That is an orange

The data.frame created by readtext() can also be used directly in the quanteda package (a more full-featured tm alternative).

# alternative
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus(rtxt)  # works directly
cat(texts(corp))      # simpler?
## Hello world.
## This is an apple.
## 
## That is an orange

VCorpus(VectorSource(texts(corp))) # if you must...
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
Ken Benoit
  • 14,454
  • 27
  • 50