Read all csv files in a directory and add the name of each file in a new column

Question

I have this code that reads all CSV files in a directory.

nm <- list.files()

df <- do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T)))

I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?

Adding a `dplyr::mutate` after your `read_delim` should do the job assuming you have installed the `dplyr` package. `do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T) %>% dplyr::mutate(filename = x)))` — benson23, Jan 09 '23 at 04:15

score 5 · Accepted Answer · answered Jan 09 '23 at 04:20

Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:

library(readr)
library(purrr)

df <- list.files() |>
  set_names() |>
  map_dfr(read_delim, .id = "file")

df

# A tibble: 9 × 3
  file    col1  col2
  <chr>  <dbl> <dbl>
1 f1.csv     1     4
2 f1.csv     2     5
3 f1.csv     3     6
4 f2.csv     1     4
5 f2.csv     2     5
6 f2.csv     3     6
7 f3.csv     1     4
8 f3.csv     2     5
9 f3.csv     3     6

Example data:

for (f in c("f1.csv", "f2.csv", "f3.csv")) {
  readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}

Notice [`purrr::map_dfr()` was superceded](https://stackoverflow.com/a/75062136/1082435) recently by the developers. — wibeasley, Jan 09 '23 at 19:35

wibeasley · Answer 2 · 2023-01-09T04:42:45.610

readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."

nm |> 
  readr::read_csv(
    id = "file_path"
  )

I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.

nm |> 
  readr::read_csv(
    id = "file_path"
  ) |> 
  dplyr::mutate(
    file_name_1 = basename(file_path),                     # If you want the extension
    file_name_2 = tools::file_path_sans_ext(file_name_1),  # If you don't
  )

wibeasley · Answer 3 · 2023-01-10T14:37:55.323

For conventional scenarios, I prefer for readr to loop through the csvs by itself. But there some scenarios where it helps to process files individually before stacking them together.

A few weeks ago, purrr 1.0's map_dfr() function was "superseded in favour of using the appropriate map function along with list_rbind()".

@zephryl's snippet is slightly modified to become

list.files() |>
  rlang::set_names() |>
  purrr::map(readr::read_delim) |> 
  # { possibly process files here before stacking/binding } |>
  purrr::list_rbind(names_to = "file")

The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.

Instead, we recommend using map(), map2(), etc with list_rbind() and list_cbind(). These use vctrs::vec_rbind() and vctrs::vec_cbind() under the hood, and have names that more clearly reflect their semantics.

Source: https://purrr.tidyverse.org/reference/map_dfr.html

Carlos Luis Rivera · Answer 4 · 2023-01-09T04:30:35.997

Here is another solution using purrr, which removes the file extention from the value in the column filename.

library(tidyverse)

nm <- list.files(pattern = "\\.csv$")

df <- map_dfr(
  .x = nm,
  ~ read.csv(.x) %>%
    mutate(
      filename = stringr::str_replace(
        .x,
        "\\.csv$",
        ""
      )
    )
)

View(df)

EDIT

Actually you can still removes the file extention from the column for the file names when you apply @zephryl's approach by adding a mutate() process as follows:

df <- nm %>%
  set_names() %>%
  map_dfr(read_delim, .id = "file") %>%
  mutate(
    file = stringr::str_replace(
      file,
      "\\.csv$",
      ""
    )
  )

Darren Tsai · Answer 5 · 2023-01-09T04:50:13.180

2

You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.

df <- dplyr::bind_rows(
  lapply(setNames(nm, basename(nm)), read_csv2),
  .id = 'src'
)

The use of basename() removes the directory paths prepended to the file names.

edited Jan 09 '23 at 04:50

answered Jan 09 '23 at 04:25

Darren Tsai

32,117
5
21
51

Read all csv files in a directory and add the name of each file in a new column

5 Answers5

EDIT