Add "filename" column to table as multiple files are read and bound

Question

I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:

filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>% 
  bind_rows()

This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:

sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")

I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.

Konrad Rudolph · Accepted Answer · 2017-09-19T13:34:21.103

10

I generally use the following approach, based on dplyr/tidyr:

data = tibble(File = files) %>%
    extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
    mutate(Data = lapply(File, read_csv)) %>%
    unnest(Data) %>%
    select(-File)

edited Sep 19 '17 at 13:34

answered Sep 19 '17 at 13:00

Konrad Rudolph

530,221
131
937
1,214

Very slick. Thank you. I find mutate() and extract() and unnest() a little difficult to follow, but it worked great! Also, what is "select(-File)" doing? – kray Sep 19 '17 at 13:26
@kray It’s just removing the `File` column, after we’re done with it. As for finding the workflow difficult to follow: it’s temporary. Once you get used to the dplyr pipeline flow, this will become effortless. That said, I’ve now rearranged the expressions to put the data reading code together. – Konrad Rudolph Sep 19 '17 at 13:34
How might this be done if we are reading plain text files -- in my case, output reports from a batch process? I'd like to stack my batch output files, then read each to determine if an error is present. – Ben Mar 03 '22 at 14:26

derelict · Answer 2 · 2023-06-08T19:20:13.113

tidyverse approach:

Update:

read_csv from readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:

(files <- fs::dir_ls("D:/data", glob="*.csv$"))
dat <- read_csv(files, id="path")

Alternatively using map_dfr with purrr: Add the filename using the .id = "source" argument in purrr::map_dfr() An example loading .csv files:

 # specify the directory, then read a list of files
  data_dir <- here("file/path")
  data_list <- fs::dir_ls(data_dir, regexp = ".csv$")

 # return a single data frame w/ purrr:map_dfr 
 my_data = data_list %>% 
    purrr::map_dfr(read_csv, .id = "source")
  
 # Alternatively, rename source from the file path to the file name
  my_data = data_list %>% 
    purrr::map_dfr(read_csv, .id = "source") %>% 
    dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))

Very helpful answer! But don't forget to escape the `.` in your `regexp`. ;) — Greg, Jun 20 '23 at 21:38

score 5 · Answer 3 · answered Sep 19 '17 at 12:23

You could use purrr::map2 here, which works similarly to mapply

filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")  # same length as filenames

library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites))  # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y))  # .x is element in filenames, and .y is element in sites

The output of map2 is a list, similar to lapply

If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index

score 3 · Answer 4 · answered Apr 01 '19 at 08:00

3

data.table approach:

If you name the list, then you can use this name to add to the data.table when binding the list together.

workflow

files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )

answered Apr 01 '19 at 08:00

Wimpel

26,031
1
20
37

3

I think you meant: names(l) <- basename(files) – Jerome Dineen May 08 '20 at 22:31
2

Also, I think this comment has the preferred approach: https://stackoverflow.com/a/45522323/7654965 – Jerome Dineen May 08 '20 at 22:40
you should include `full.names = TRUE` in `list.files` in case that the needed files are stored in subdirectories – Bảo Trần Dec 26 '22 at 08:05

score 2 · Answer 5 · answered Sep 19 '17 at 13:04

2

You just need to write your own function that reads the csv and adds the column you want, before combining them.

my_read_csv <- function(x) {
  out <- read_csv(x)
  site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
  cbind(Site=site, out)
}

filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()

answered Sep 19 '17 at 13:04

Aaron left Stack Overflow

36,704
7
77
142

or you could do this: map_dfr(filenames, my_read_csv) %>% as_tibble() – Levi Brackman Aug 22 '20 at 18:46

score 0 · Answer 6 · answered Sep 19 '17 at 12:10

You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind

### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")

### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))

### Repeat sites using lengths
file_names <- rep(sites,file_lengths))

### Create table
tbl <- lapply(filenames, read_csv) %>% 
  bind_rows()

### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)

Add "filename" column to table as multiple files are read and bound

6 Answers6

Linked

Related