4

I have this code that reads all CSV files in a directory.

nm <- list.files()

df <- do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T)))

I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?

wibeasley
  • 5,000
  • 3
  • 34
  • 62
GitZine
  • 445
  • 3
  • 14
  • 1
    Adding a `dplyr::mutate` after your `read_delim` should do the job assuming you have installed the `dplyr` package. `do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T) %>% dplyr::mutate(filename = x)))` – benson23 Jan 09 '23 at 04:15

5 Answers5

5

Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:

library(readr)
library(purrr)

df <- list.files() |>
  set_names() |>
  map_dfr(read_delim, .id = "file")

df
# A tibble: 9 × 3
  file    col1  col2
  <chr>  <dbl> <dbl>
1 f1.csv     1     4
2 f1.csv     2     5
3 f1.csv     3     6
4 f2.csv     1     4
5 f2.csv     2     5
6 f2.csv     3     6
7 f3.csv     1     4
8 f3.csv     2     5
9 f3.csv     3     6

Example data:

for (f in c("f1.csv", "f2.csv", "f3.csv")) {
  readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • 1
    Notice [`purrr::map_dfr()` was superceded](https://stackoverflow.com/a/75062136/1082435) recently by the developers. – wibeasley Jan 09 '23 at 19:35
4

readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."

nm |> 
  readr::read_csv(
    id = "file_path"
  )

I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.

nm |> 
  readr::read_csv(
    id = "file_path"
  ) |> 
  dplyr::mutate(
    file_name_1 = basename(file_path),                     # If you want the extension
    file_name_2 = tools::file_path_sans_ext(file_name_1),  # If you don't
  ) 
wibeasley
  • 5,000
  • 3
  • 34
  • 62
3

For conventional scenarios, I prefer for readr to loop through the csvs by itself. But there some scenarios where it helps to process files individually before stacking them together.

A few weeks ago, purrr 1.0's map_dfr() function was "superseded in favour of using the appropriate map function along with list_rbind()".

@zephryl's snippet is slightly modified to become

list.files() |>
  rlang::set_names() |>
  purrr::map(readr::read_delim) |> 
  # { possibly process files here before stacking/binding } |>
  purrr::list_rbind(names_to = "file")

The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.

Instead, we recommend using map(), map2(), etc with list_rbind() and list_cbind(). These use vctrs::vec_rbind() and vctrs::vec_cbind() under the hood, and have names that more clearly reflect their semantics.

Source: https://purrr.tidyverse.org/reference/map_dfr.html

wibeasley
  • 5,000
  • 3
  • 34
  • 62
2

Here is another solution using purrr, which removes the file extention from the value in the column filename.

library(tidyverse)

nm <- list.files(pattern = "\\.csv$")

df <- map_dfr(
  .x = nm,
  ~ read.csv(.x) %>%
    mutate(
      filename = stringr::str_replace(
        .x,
        "\\.csv$",
        ""
      )
    )
)

View(df)

EDIT

Actually you can still removes the file extention from the column for the file names when you apply @zephryl's approach by adding a mutate() process as follows:

df <- nm %>%
  set_names() %>%
  map_dfr(read_delim, .id = "file") %>%
  mutate(
    file = stringr::str_replace(
      file,
      "\\.csv$",
      ""
    )
  )
Carlos Luis Rivera
  • 3,108
  • 18
  • 45
2

You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.

df <- dplyr::bind_rows(
  lapply(setNames(nm, basename(nm)), read_csv2),
  .id = 'src'
)

The use of basename() removes the directory paths prepended to the file names.

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51