0

I have several thousand .txt files in a directory and would like to read them all into tidytext where I would then add columns of metadata. The filenames themselves contain all of the metadata and I have been successful in using substr to parse the different pieces location, time, date, etc. of the one file, but I cannot find an example of how I might proceed to do this for all of the files in the directory.

For example, I have the .txt files:

FFTJan141138

FFTJan151136

FFTJan161151

FFTJan171144

I have managed read the files from my wd into a tibble using:

tbl <- list.files(pattern = "*.txt") %>% 
map_chr(~ read_file(.)) %>% 
data_frame(text = .)

What I need help with is inserting some data columns that correspond to the metadata in the file names.

For example, for the first file named: FFTJan141138 I now have the tibble whose row for this file has one column for the contents of FFTJan141138. I would like to add to this row four additional columns that include FFT, JAN, 14, and 1138. I can parse the text in the file names with substr, but don't know how to do this as the data is read into tidytext. Any help would be appreciated.

Thanks.

AlanS
  • 13
  • 3

1 Answers1

1

I'd adjust your workflow just a little bit to get that info that you want. To find all the text files within a working directory, you can use list.files with an argument:

all_txts <- list.files(pattern = ".txt$")

The all_txts object will then be a character vector that contains all your filenames.

Then, you can set up a pipe to read in all the text files and use a mutate() within the map() to annotate each line with the filename, if you'd like.

library(tidyverse)

map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
        mutate(filename = basename(.x)))
Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • This worked wonderfully - Thank you. To your code above, I added some additional mutate statements to parse the filename string to extract the metadata and to then include it in the data frame. Using my initial approach, I did find a less elegant, more "piston-driven" solution by first reading in the text files, then reading and parsing the file names, and then finally zipping the text and metadata up with an add_column. Your solution above is way more streamlined. P. S. I love your book and am going to use tidytext for may major projects. – AlanS Mar 03 '19 at 17:53
  • One other thing....Once I implemented the solution, I have run into trouble trying to unnest_tokens(word,text). The error raised is: Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1. ....Thoughts? – AlanS Mar 03 '19 at 18:33
  • Ah, if you want to use unnest_tokens() too, check out the answer to this question: https://stackoverflow.com/questions/54850258/creating-corpus-from-multiple-txt-files/54964027#54964027 – Julia Silge Mar 04 '19 at 03:18