I have data of the structure:
- Main_Text
- Sub1_text
- Sub2_text
- Etc (I have several hundred subfolders)
Each subfolder containers multiple .txt files.
I want to read all of the files into R, to create a data frame that looks like this:
Filename | Text
Name of file | Content of .txt file
I've tried the following two approaches, and neither quite works. Any help would be appreciated.
1) Using the readtext package: although this package supposedly loops through subfolders, I cannot get it to do so. The code to loop through the files in the readtext vignette should work like this:
dir <- "/Users/Main_Folder"
text = readtext(paste0(dir, "/Main_Text/*.txt"))
This only produces an error:
Error in listMatchingFiles(i, ignoreMissing = ignoreMissing, lastRound = T) : File '' does not exist.
It works, however, if I specify the subfolder, i.e.
text = readtext(paste0(dir, "/Main_Text/Sub1_text*.txt"))
but given that I have several hundred subfolders, I need a more recursive solution.
2) I've also tried the following two step solution, where I create a list of the files first and then attempt to read in the text, which is also resulting in an error:
This generates an accurate list of all my files, but obviously doesn't include a content generating step:
setwd("/Users/Main_Folder")
dat = basename(list.files(pattern = ".txt$", recursive = TRUE, full.names=TRUE, include.dirs=TRUE))
So I also tried: mypath="/Users/Main_Folder/" txt_files_ls = list.files(path=mypath, recursive=T, pattern="*.txt")
Which works, however:
txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x, header = F, fill=T, sep =",")})
Throws an error:
Error in read.table(file = x, header = F, fill = T, sep = ",") : no lines available in input In addition: There were 42 warnings (use warnings() to see them)
If I specify
header=T
I get a different error:
Error in read.table(file = x, header = T, fill = T, sep = ",") : more columns than column names In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
So I can't even get to the final step of combining them using something like
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))
I have a sense of why this is, given that the text files themselves don't have headers, and have random formatting (they're press releases). Here's a sample of one of my .txt files:
cat(readLines("Aderholt_text/Aderholt1-28-11.txt"), sep = "\n")
Friday January 28, 2011 Contact: Darrell "DJ" Jordan 202-225-4876 CONGRESSMAN ROBERT ADERHOLT STATEMENT ON THE VIOLENCE IN ALBANIA Washington, DC - Congressman Robert Aderholt (R-Alabama) today issued th
I'm sure I'm missing something small, but can anyone help illuminate how to correctly read in the filenames + text, either using one of the half-working solutions I've tried, or something else?