0

I have data of the structure:

  • Main_Text
    • Sub1_text
    • Sub2_text
    • Etc (I have several hundred subfolders)

Each subfolder containers multiple .txt files.

I want to read all of the files into R, to create a data frame that looks like this:

Filename | Text

Name of file | Content of .txt file

I've tried the following two approaches, and neither quite works. Any help would be appreciated.

1) Using the readtext package: although this package supposedly loops through subfolders, I cannot get it to do so. The code to loop through the files in the readtext vignette should work like this:

dir <- "/Users/Main_Folder"
text = readtext(paste0(dir, "/Main_Text/*.txt"))

This only produces an error:

Error in listMatchingFiles(i, ignoreMissing = ignoreMissing, lastRound = T) : File '' does not exist.

It works, however, if I specify the subfolder, i.e.

text = readtext(paste0(dir, "/Main_Text/Sub1_text*.txt"))

but given that I have several hundred subfolders, I need a more recursive solution.

2) I've also tried the following two step solution, where I create a list of the files first and then attempt to read in the text, which is also resulting in an error:

This generates an accurate list of all my files, but obviously doesn't include a content generating step:

setwd("/Users/Main_Folder")
dat = basename(list.files(pattern = ".txt$", recursive = TRUE, full.names=TRUE, include.dirs=TRUE))

So I also tried: mypath="/Users/Main_Folder/" txt_files_ls = list.files(path=mypath, recursive=T, pattern="*.txt")

Which works, however:

txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x, header = F, fill=T, sep =",")})

Throws an error:

Error in read.table(file = x, header = F, fill = T, sep = ",") : no lines available in input In addition: There were 42 warnings (use warnings() to see them)

If I specify

header=T

I get a different error:

Error in read.table(file = x, header = T, fill = T, sep = ",") : more columns than column names  In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :

So I can't even get to the final step of combining them using something like

combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))

I have a sense of why this is, given that the text files themselves don't have headers, and have random formatting (they're press releases). Here's a sample of one of my .txt files:

cat(readLines("Aderholt_text/Aderholt1-28-11.txt"), sep = "\n")

Friday January 28, 2011 Contact: Darrell "DJ" Jordan 202-225-4876 CONGRESSMAN ROBERT ADERHOLT STATEMENT ON THE VIOLENCE IN ALBANIA Washington, DC - Congressman Robert Aderholt (R-Alabama) today issued th

I'm sure I'm missing something small, but can anyone help illuminate how to correctly read in the filenames + text, either using one of the half-working solutions I've tried, or something else?

Rachel B.
  • 61
  • 1
  • 7
  • 1
    Instead of using `read.table`, can't you go straight for `readLines`, as in your last statement? Something like `txt_files_df <- lapply(txt_files_ls, readLines)` – Felipe Alvarenga Feb 28 '18 at 14:25
  • That reads them in! Thank you! It does, however, read each line break in the .txt files as another row instead of reading the whole file in as one row. Suggestion for how to change that? It looks like this: – Rachel B. Feb 28 '18 at 14:38
  • `[1] "Friday January 28, 2011" [2] "Contact: Darrell \"DJ\" Jordan" ` – Rachel B. Feb 28 '18 at 14:40
  • 1
    @RachelB. see https://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string. you can use e.g. `paste(readLines(x), collapse = "\n")` – ozacha Feb 28 '18 at 14:41
  • Thank you--I'm missing where to integrate that into my code, though, specifically in the line where I create txt_files_df as in Felipe's comment. – Rachel B. Feb 28 '18 at 14:51
  • 1
    That would be inside the `lapply`: `txt_files_df <- lapply(txt_files_ls, function(x) paste(readLines(x), collapse = "\n"))` – ozacha Feb 28 '18 at 15:00
  • @RachelB. Hi, Did you find a solution to your deleted question? If not here is one, `a<-crossprod(table(df$vote.choice,df$rep)) %>% as.data.frame()` then `a %>% rownames_to_column() %>% gather(key,value,-rowname) %>% filter(rowname!=key) %>% arrange(rowname,key)`. Here is a reproducble data `df <- read.table(text = " rep vote.choice RepA billa.1 RepB billa.1 RepA billa.2 RepB billa.2 RepC billa.3 RepA billa.3 RepC billa.1 ",header=T)` – A. Suliman Aug 25 '18 at 16:36
  • The readtext error could be caused by having an empty folder under your /Main_folder. – Irix3537106 Aug 10 '20 at 19:50

0 Answers0