I'm working on some workflow to join .csv
files in a directory based on conditions met within the character strings of the files themselves.
For example, lets say I have the following files:
/path/Dataset_1/Cohort_1/Grouping_1_Data.csv
/path/Dataset_2/Cohort_1/Grouping_2_Data.csv
/path/Dataset_1/Cohort_2/Grouping_3_Data.csv
/path/Dataset_2/Cohort_2/Grouping_4_Data.csv
/path/Dataset_1/Cohort_3/Grouping_5_Data.csv
/path/Dataset_2/Cohort_3/Grouping_6_Data.csv
/path/Dataset_1/Cohort_4/Grouping_7_Data.csv
/path/Dataset_2/Cohort_4/Grouping_8_Data.csv
And they look something like this: when loaded (in order of appearance):
[[1]]
ID Binary Continuous date1 date2 character
1 101 0 0 20-02-01 21-01-01 CHAR
2 102 0 2 21-03-01 22-01-01 CHAR
3 103 0 5 22-04-01 23-01-01 CHAR
[[2]]
ID Binary Continuous date1 date2 character
1 102 0 2 20-03-01 22-01-01 CHAR
2 103 0 5 21-04-01 23-01-01 CHAR
3 104 0 7 22-05-01 23-01-01 CHAR
[[3]]
ID Binary Continuous date1 date2 character
1 105 0 0 20-02-01 21-01-01 CHAR
2 106 0 2 21-03-01 22-01-01 CHAR
3 107 0 5 22-04-01 23-01-01 CHAR
[[4]]
ID Binary Continuous date1 date2 character
1 106 0 2 20-03-01 22-01-01 CHAR
2 107 0 5 21-04-01 23-01-01 CHAR
3 108 0 7 22-05-01 23-01-01 CHAR
[[5]]
ID Binary Continuous date1 date2 character
1 109 0 0 20-02-01 21-01-01 CHAR
2 110 0 2 21-03-01 22-01-01 CHAR
3 111 0 5 22-04-01 23-01-01 CHAR
[[6]]
ID Binary Continuous date1 date2 character
1 110 0 2 20-03-01 22-01-01 CHAR
2 111 0 5 21-04-01 23-01-01 CHAR
3 112 0 7 22-05-01 23-01-01 CHAR
[[7]]
ID Binary Continuous date1 date2 character
1 113 0 0 20-02-01 21-01-01 CHAR
2 114 0 2 21-03-01 22-01-01 CHAR
3 115 0 5 22-04-01 23-01-01 CHAR
[[8]]
ID Binary Continuous date1 date2 character
1 114 0 2 20-03-01 22-01-01 CHAR
2 115 0 5 21-04-01 23-01-01 CHAR
3 116 0 7 22-05-01 23-01-01 CHAR
And I want to simultaneously join them with something like lapply
or map_df
with read.csv
and a list of all the file paths (csvdfs):
outdf <- csvdfs %>% map_df(~read.csv(.))
But I would also like to put in some conditionals in a call to either lapply
or map_df
to add extra columns to the data with the relevant data as part of merging that contain the dataset,
cohort, and grouping information so that the final product looks something like this:
ID Dataset Cohort Grouping Binary Continuous date1 date2 character
101 D1 C1 G1 0 0 20-02-01 21-01-01 CHAR
102 D1 C1 G1 0 2 21-03-01 22-01-01 CHAR
103 D1 C1 G1 0 5 22-04-01 23-01-01 CHAR
102 D2 C1 G2 0 2 20-03-01 22-01-01 CHAR
103 D2 C1 G2 0 5 21-04-01 23-01-01 CHAR
104 D2 C1 G2 0 7 22-05-01 23-01-01 CHAR
105 D1 C2 G3 0 0 20-02-01 21-01-01 CHAR
106 D1 C2 G3 0 2 21-03-01 22-01-01 CHAR
107 D1 C2 G3 0 5 22-04-01 23-01-01 CHAR
106 D2 C2 G4 0 2 20-03-01 22-01-01 CHAR
107 D2 C2 G4 0 5 21-04-01 23-01-01 CHAR
108 D2 C2 G4 0 7 22-05-01 23-01-01 CHAR
109 D1 C3 G5 0 0 20-02-01 21-01-01 CHAR
110 D1 C3 G5 0 2 21-03-01 22-01-01 CHAR
111 D1 C3 G5 0 5 22-04-01 23-01-01 CHAR
110 D2 C3 G6 0 2 20-03-01 22-01-01 CHAR
111 D2 C3 G6 0 5 21-04-01 23-01-01 CHAR
112 D2 C3 G6 0 7 22-05-01 23-01-01 CHAR
113 D1 C4 G7 0 0 20-02-01 21-01-01 CHAR
114 D1 C4 G7 0 2 21-03-01 22-01-01 CHAR
115 D1 C4 G7 0 5 22-04-01 23-01-01 CHAR
114 D2 C4 G8 0 2 20-03-01 22-01-01 CHAR
115 D2 C4 G8 0 5 21-04-01 23-01-01 CHAR
116 D2 C4 G8 0 7 22-05-01 23-01-01 CHAR
Not unlike the calls in these examples:
- Using lapply and the ifelse function in R
- How to add a factor column to dataframe based on a conditional statement from another column?
- https://www.cnblogs.com/drvongoosewing/p/15859210.html
But with the presence or absences of strings in the filename (e.g. grepl
or str_detect
Group_1, Dataset_1), or a rep()
call to fill an entire column with a value based on the presence or absence of a string in the filename from the incoming csv prior to joining.
Any help or suggestions which would accomplish this outside of using a loop would be greatly appreciated!
Some code to make some csv files in directories and list of characters for said files if needed:
library(tidyverse)
library(stringr)
file_list <- c("/path/Dataset_1/Cohort_1/Grouping_1_Data.csv", "/path/Dataset_2/Cohort_1/Grouping_2_Data.csv", "/path/Dataset_1/Cohort_2/Grouping_3_Data.csv", "/path/Dataset_2/Cohort_2/Grouping_4_Data.csv", "/path/Dataset_1/Cohort_3/Grouping_5_Data.csv", "/path/Dataset_2/Cohort_3/Grouping_6_Data.csv", "/path/Dataset_1/Cohort_4/Grouping_7_Data.csv", "/path/Dataset_2/Cohort_4/Grouping_8_Data.csv")
dir.create(paste0(getwd(),paste(unique(dirname(dirname(dirname(file_list)))))))
lapply(paste0(getwd(),paste(unique(dirname(dirname(file_list))))),dir.create)
lapply(paste0(getwd(),paste(unique(dirname(file_list)))),dir.create)
csvdfs <- list()
csvdfs[[1]] <- structure(list(ID = 101:103, Binary = c(0L, 0L, 0L), Continuous = c(0L, 2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[2]] <- structure(list(ID = 102:104, Binary = c(0L, 0L, 0L), Continuous = c(2L, 5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[3]] <- structure(list(ID = 105:107, Binary = c(0L, 0L, 0L), Continuous = c(0L, 2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[4]] <- structure(list(ID = 106:108, Binary = c(0L, 0L, 0L), Continuous = c(2L, 5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[5]] <- structure(list(ID = 109:111, Binary = c(0L, 0L, 0L), Continuous = c(0L, 2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[6]] <- structure(list(ID = 110:112, Binary = c(0L, 0L, 0L), Continuous = c(2L, 5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[7]] <- structure(list(ID = 113:115, Binary = c(0L, 0L, 0L), Continuous = c(0L, 2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[8]] <- structure(list(ID = 114:116, Binary = c(0L, 0L, 0L), Continuous = c(2L, 5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
for (outputs in 1:length(file_list)){
write.csv(csvdfs[[outputs]],file=paste0(getwd(),file_list[[outputs]]))
}