creating corpus from multiple txt files

Question

I have multiple txt files, I want to have a tidy data. To do that first I create corpus ( I am not sure is it true way to do it). I wrote the following code to have the corpus data.

folder<-"C:\\Users\\user\\Desktop\\text analysis\\doc"
list.files(path=folder) 
filelist<- list.files(path=folder, pattern="*.txt")
paste(folder, "\\", filelist)
filelist<-paste(folder, "\\", filelist, sep="")
typeof(filelist)
a<- lapply(filelist,FUN=readLines)
corpus <- lapply(a ,FUN=paste, collapse=" ")

When I check the class(corpus) it returns list. From that point how can I create tidy data?

phiver · Answer 1 · 2019-02-24T10:48:13.880

Looking at your other question as well, you need to read up on text-mining and how to read in files. Your result now is a list object. In itself not a bad object, but for your purposes not correct. Instead of lapply, use sapply in your last line, like this:

corpus <- sapply(a , FUN = paste, collapse = " ")

This will return a character vector. Next you need to turn this into a data.frame. I added the filelist to the data.frame to keep track of which text belongs to which document.

my_data <- data.frame(files = filelist, text = corpus, stringsAsFactors = FALSE)

and then use tidytext to continue:

library(tidytext)
tidy_text <- unnest_tokens(my_data, words, text)

using tm and tidytext package

If you would use the tm package, you could read everything in like this:

library(tm)
folder <- getwd() # <-- here goes your folder

corpus <- VCorpus(DirSource(directory = folder,
                            pattern = "*.txt"))

which you could turn into tidytext like this:

library(tidytext)
tidy_corpus <- tidy(corpus)
tidy_text <- unnest_tokens(tidy_corpus, words, text)

It worked very well, thanks. The point which is not clear for me is, when I create it, I have two coloumns (understanable) but one of ther name is `files....filelist`, the other one is words. I tried to change the `files....filelist` to filename with `colnames(tidy_text[colnames(tidy_text)=="files....filelist"] <- "filenames" ` but this does not work. Why? — FGH, Feb 24 '19 at 10:32

Julia Silge · Accepted Answer · 2019-03-04T03:20:14.887

If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.

To find all the text files within a working directory, you can use list.files with an argument:

all_txts <- list.files(pattern = ".txt$")

The all_txts object will then be a character vector that contains all your filenames.

Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.

library(tidyverse)
library(tidytext)

map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
        mutate(filename = basename(.x)) %>%
        unnest_tokens(word, txt))

Great answer. I elaborated on it here: https://stackoverflow.com/a/60321956/1839959 — Stan, Feb 20 '20 at 14:26

creating corpus from multiple txt files

2 Answers2

using tm and tidytext package

Linked