Read multiple .txt files and add new column identifying file name in R

Question

I have 1500+ .txt files called data_{date from 2015070918 to today} all with 7 columns worth of data and variable row amounts. I have managed to use the following code to extract and merge the data into one table:

files = list.files(pattern = ".txt")
myData <- lapply(files, function(x) {
tryCatch(read.table(x, header = F, sep = ','), error=function(e) NULL)
})

Note: there are no headers on the columns, currently I don't even know which variable is which!

At the moment the data only has the date in the file name and therefore it isn't possible to distinguish between each subset of daily data. I want to create an additional column to include the date which I can extract if I can include the filename in an additional column.

I searched on stackexchange and came across this possible solution: Importing multiple .csv files into R and adding a new column with file name

df <- do.call(rbind, lapply(files, function(x) cbind(read.csv(x, header = F, sep = ","), name=strsplit(x,'\\.')[[1]][1])))

However I get the following error:

 Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
 no lines available in input

I have used read.csv on individual files and they have imported without any issues. Any ideas to resolve this would be greatly appreciated!

score 3 · Accepted Answer · answered Nov 16 '19 at 17:12

3

This should work, if your read.table command is correct:

myData_list <- lapply(files, function(x) {
  out <- tryCatch(read.table(x, header = F, sep = ','), error = function(e) NULL)
  if (!is.null(out)) {
    out$source_file <- x
  }
  return(out)
})

myData <- data.table::rbindlist(myData_list)

In the past I found that you can spare yourself a lot of headache using data.table::fread instead of read.table. So you could consider this:

myData_list <- lapply(files, function(x) {
  out <- data.table::fread(x, header = FALSE)
  out$source_file <- x
  return(out)
})

myData <- data.table::rbindlist(myData_list)

You can add the tryCatch part back if necessary. Depending on how the files vector looks, basename() might be interesting to use on the column source_file.

answered Nov 16 '19 at 17:12

JBGruber

11,727
1
23
45

Many thanks for your reply. I have one further issue if you happen to have the time. The names are all being fetched until it reaches a name with a " ' " as the first letter which causes the subsequent rows to all be part of that cell. Any ideas how to overcome this? – Aesler Nov 20 '19 at 19:17
What names do you mean? In the table? File names? I think you can set a custom quote character in both `read.table` and `fread`. Or you can read the file in via `readLines()`, remove the values that cause an error and then feed the character object to `fread` as `text` argument instead of `file`. If this doesn't help I would recommend you open a new question with a reproducible example. – JBGruber Nov 21 '19 at 10:55
It is in the name column where one of the entries is " 't name " (so with an apostrophe as the first character). Then all the rows after where this occurs are compiled into the cell will the " 't name " and the remaining column values of that row change to NA. – Aesler Nov 21 '19 at 12:06

score 1 · Answer 2 · answered Nov 16 '19 at 17:11

You could try using sapply with an index corresponding to each of the files:

files <- list.files(pattern = ".txt")
myData <- lapply(seq_along(files), function(x) {
    tryCatch(
        {
            dt <- read.table(files[x], header = F, sep = ',')
            dt$index <- x   # or files[x] is you want to use the file name instead
            dt
        },
        error=function(e) { NULL }
    )
})

Read multiple .txt files and add new column identifying file name in R

2 Answers2