2

Up until recently (1 month ago) the code shown below allowed me to import a series of .txt documents stored in a local folder into R, to create a Corpus, pre-process it and finally to convert it into a Document Term Matrix. The issue I am having is that the document names are not being imported, instead each document is listed as 'character(0)'.

One of my aims is to conduct topic modelling on the corpus and so it is important that I can relate the document names to the topics that the model produces.

Does anyone have any suggestions as to what has changed? Or how I can fix this?

library("tm")
library("SnowballC")

setwd("C:/Users/Documents/Dataset/")
corpus <-Corpus(DirSource("blog"))


#pre_processing
myStopwords <- c(stopwords("english"))
your_corpus <- tm_map(corpus, tolower)
your_corpus <- tm_map(your_corpus, removeNumbers)
your_corpus <- tm_map(your_corpus, removeWords, myStopwords) 
your_corpus <- tm_map(your_corpus, stripWhitespace)
your_corpus <- tm_map(your_corpus, removePunctuation)
your_corpus <- tm_map(your_corpus, stemDocument)
your_corpus <- tm_map(your_corpus, PlainTextDocument)

#creating a doucment term matrix
myDtm <- DocumentTermMatrix(your_corpus, control=list(wordLengths=c(3,Inf)))

dim(myDtm)
inspect(myDtm)
user3587152
  • 73
  • 3
  • 5
  • I previously had this problem, but don't remember the issue / resolution. If you examine your_corpus after every operation, you can see when the id is dropped. Then you can search so for that operation. Also, check this answer http://stackoverflow.com/questions/24501514/keep-document-id-with-r-corpus –  Oct 08 '14 at 14:25

2 Answers2

2

Here's a debugging session to identify / correct the loss of file name. The tolower line was modified, and the plaintext line was commented-out since these lines remove the file information. Also, if you check ds$reader, you can see the baseline reader creates a plain text document.

library("tm")
library("SnowballC")

# corpus <-Corpus(DirSource("blog"))

sf<-system.file("texts", "txt", package = "tm")
ds <-DirSource(sf)
your_corpus <-Corpus(ds)

# Check status with the following line
meta(your_corpus[[1]])

#pre_processing
myStopwords <- c(stopwords("english"))
# your_corpus <- tm_map(your_corpus, tolower)
your_corpus <- tm_map(your_corpus, content_transformer(tolower))
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removeNumbers)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removeWords, myStopwords) 
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, stripWhitespace)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removePunctuation)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, stemDocument)
meta(your_corpus[[1]])
#your_corpus <- tm_map(your_corpus, PlainTextDocument)
#meta(your_corpus[[1]])

#creating a doucment term matrix
myDtm <- DocumentTermMatrix(your_corpus, control=list(wordLengths=c(3,Inf)))

dim(myDtm)
inspect(myDtm)
  • @user3969377: if I comment out the PlainText Document line, I get an `Error: inherits(doc, "TextDocument") is not TRUE`. it was only to get rid of this, that I introduced the PlainText Document transform. File names still missing though. – Pradeep Jun 30 '16 at 10:08
  • You can access the word 'id' field in the crops and replace it in a loop with your file names. The id is accessible from here. Replace names easily like this: Your_corpos[[2]]$meta$id <- "2ndfileName". Put this is a loop and you are good to go. – Espanta Sep 10 '16 at 07:02
0

Here's an approach using qdap where I make a function to read in a directory of files and convert them to a data.frame:

library(qdap)
sf <- system.file("texts", "txt", package = "tm")

read_in <- function(sf) {
    list2df(setNames(lapply(file.path(sf, dir(sf)), function(x) {
        clean(unbag(readLines(x)))}), dir(sf)), "text", "source")[, 2:1]
}

mydtm <- with(read_in(sf), as.dtm(text, source, stem=TRUE, 
    stopwords=tm::stopwords("english")))
mydtm <- Filter(mydtm, min=3)
inspect(mydtm)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519