1

Dear R users and programmers: I have tens of CSV files with similar col and row names but in different folders. I want to merge all of them while using the folder names and CSV file names as identifiers. Is there any way to do that in R? Thanks

hema
  • 725
  • 1
  • 8
  • 20
  • 1
    Use `list.files` or `list.dirs` to create a list of the files you want. Note that the `full.names` argument can give full paths. Then to read the data, you can use [vroom](https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#reading-multiple-files), or you can use `readr::read_csv` together with `purrr::map_dfr` or `purrr::bind_rows` – Hauke L. Feb 04 '23 at 09:33
  • `readr::read_csv` accepts list of filesnames and returns a single row-binded tibble, for storing filenames in result there's `id` param. Though... **similar** col names, not identical? – margusl Feb 04 '23 at 10:51
  • Can you post a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (or as reproducible as possible) so we can get a sense of *how* similar the column names are and what the folder names look like? I answered a similar question [here](https://stackoverflow.com/questions/74671514/adding-an-identify-original-csv-column-in-r-markdown/74671763#74671763), but your column names not being identical and the extra level of folders will make things tricky. – jrcalabrese Feb 04 '23 at 17:37
  • Thank you, Col's names are identical, and Charles Gallagher created a reproducible example – hema Feb 06 '23 at 06:56

1 Answers1

1

Create data for a reproducible example

example_data <- mtcars
dirs <- c("dir_1", "dir_2", "dir_3", "dir_4")

df_list <- split(example_data, factor(sort(rank(row.names(example_data))%%4)))
for(i in c(1:4)){
  dir.create(dirs[i])
  write.csv(df_list[[i]],paste0(dirs[i],"/mtcars_",i,".csv"))}

Build data frame from files in multiple directories

# helper function to read a csv and 
# add a column with the label of the data source

read_label <- function(files){
  df <-read.csv(files)
  df[,ncol(df)+1] <- files
  names(df)[ncol(df)]<- "data_source"
  return(df)
}

# List the files 
files <- list.files(path = ".",pattern = "*.csv",recursive = TRUE)

# read files with labels into memory  
do.call(rbind,lapply(files, read_label))



                    X  mpg cyl  disp  hp drat    wt  qsec vs am gear carb        data_source
1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 dir_1/mtcars_1.csv
2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 dir_1/mtcars_1.csv
3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 dir_1/mtcars_1.csv
4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 dir_1/mtcars_1.csv
5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 dir_1/mtcars_1.csv
6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1 dir_1/mtcars_1.csv
7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 dir_1/mtcars_1.csv
8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 dir_1/mtcars_1.csv
9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 dir_2/mtcars_2.csv
10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 dir_2/mtcars_2.csv
  • 1
    Thank you so much for your answer and for creating an example data set. However, my problem is adding the files and folder names as identifiers in the merged excel sheet. I need to be able to identify the source of the data after merging. – hema Feb 06 '23 at 06:53
  • Ok, I think that should do it! – Charles Gallagher Feb 09 '23 at 03:36