1

I'm working on some workflow to join .csv files in a directory based on conditions met within the character strings of the files themselves.

For example, lets say I have the following files:

/path/Dataset_1/Cohort_1/Grouping_1_Data.csv
/path/Dataset_2/Cohort_1/Grouping_2_Data.csv
/path/Dataset_1/Cohort_2/Grouping_3_Data.csv
/path/Dataset_2/Cohort_2/Grouping_4_Data.csv
/path/Dataset_1/Cohort_3/Grouping_5_Data.csv
/path/Dataset_2/Cohort_3/Grouping_6_Data.csv
/path/Dataset_1/Cohort_4/Grouping_7_Data.csv
/path/Dataset_2/Cohort_4/Grouping_8_Data.csv

And they look something like this: when loaded (in order of appearance):

[[1]]
   ID Binary Continuous    date1    date2 character
1 101      0          0 20-02-01 21-01-01      CHAR
2 102      0          2 21-03-01 22-01-01      CHAR
3 103      0          5 22-04-01 23-01-01      CHAR

[[2]]
   ID Binary Continuous    date1    date2 character
1 102      0          2 20-03-01 22-01-01      CHAR
2 103      0          5 21-04-01 23-01-01      CHAR
3 104      0          7 22-05-01 23-01-01      CHAR

[[3]]
   ID Binary Continuous    date1    date2 character
1 105      0          0 20-02-01 21-01-01      CHAR
2 106      0          2 21-03-01 22-01-01      CHAR
3 107      0          5 22-04-01 23-01-01      CHAR

[[4]]
   ID Binary Continuous    date1    date2 character
1 106      0          2 20-03-01 22-01-01      CHAR
2 107      0          5 21-04-01 23-01-01      CHAR
3 108      0          7 22-05-01 23-01-01      CHAR

[[5]]
   ID Binary Continuous    date1    date2 character
1 109      0          0 20-02-01 21-01-01      CHAR
2 110      0          2 21-03-01 22-01-01      CHAR
3 111      0          5 22-04-01 23-01-01      CHAR

[[6]]
   ID Binary Continuous    date1    date2 character
1 110      0          2 20-03-01 22-01-01      CHAR
2 111      0          5 21-04-01 23-01-01      CHAR
3 112      0          7 22-05-01 23-01-01      CHAR

[[7]]
   ID Binary Continuous    date1    date2 character
1 113      0          0 20-02-01 21-01-01      CHAR
2 114      0          2 21-03-01 22-01-01      CHAR
3 115      0          5 22-04-01 23-01-01      CHAR

[[8]]
   ID Binary Continuous    date1    date2 character
1 114      0          2 20-03-01 22-01-01      CHAR
2 115      0          5 21-04-01 23-01-01      CHAR
3 116      0          7 22-05-01 23-01-01      CHAR

And I want to simultaneously join them with something like lapply or map_df with read.csv and a list of all the file paths (csvdfs):

outdf <- csvdfs %>% map_df(~read.csv(.))

But I would also like to put in some conditionals in a call to either lapply or map_df to add extra columns to the data with the relevant data as part of merging that contain the dataset, cohort, and grouping information so that the final product looks something like this:

ID Dataset Cohort Grouping Binary Continuous date1 date2 character
101 D1 C1  G1 0 0 20-02-01 21-01-01 CHAR
102 D1 C1  G1 0 2 21-03-01 22-01-01 CHAR
103 D1 C1  G1 0 5 22-04-01 23-01-01 CHAR
102 D2 C1  G2 0 2 20-03-01 22-01-01 CHAR
103 D2 C1  G2 0 5 21-04-01 23-01-01 CHAR
104 D2 C1  G2 0 7 22-05-01 23-01-01 CHAR
105 D1 C2  G3 0 0 20-02-01 21-01-01 CHAR
106 D1 C2  G3 0 2 21-03-01 22-01-01 CHAR
107 D1 C2  G3 0 5 22-04-01 23-01-01 CHAR
106 D2 C2  G4 0 2 20-03-01 22-01-01 CHAR
107 D2 C2  G4 0 5 21-04-01 23-01-01 CHAR
108 D2 C2  G4 0 7 22-05-01 23-01-01 CHAR
109 D1 C3  G5 0 0 20-02-01 21-01-01 CHAR
110 D1 C3  G5 0 2 21-03-01 22-01-01 CHAR
111 D1 C3  G5 0 5 22-04-01 23-01-01 CHAR
110 D2 C3  G6 0 2 20-03-01 22-01-01 CHAR
111 D2 C3  G6 0 5 21-04-01 23-01-01 CHAR
112 D2 C3  G6 0 7 22-05-01 23-01-01 CHAR
113 D1 C4  G7 0 0 20-02-01 21-01-01 CHAR
114 D1 C4  G7 0 2 21-03-01 22-01-01 CHAR
115 D1 C4  G7 0 5 22-04-01 23-01-01 CHAR
114 D2 C4  G8 0 2 20-03-01 22-01-01 CHAR
115 D2 C4  G8 0 5 21-04-01 23-01-01 CHAR
116 D2 C4  G8 0 7 22-05-01 23-01-01 CHAR

Not unlike the calls in these examples:

But with the presence or absences of strings in the filename (e.g. grepl or str_detect Group_1, Dataset_1), or a rep() call to fill an entire column with a value based on the presence or absence of a string in the filename from the incoming csv prior to joining.

Any help or suggestions which would accomplish this outside of using a loop would be greatly appreciated!

Some code to make some csv files in directories and list of characters for said files if needed:

library(tidyverse)
library(stringr)

file_list <- c("/path/Dataset_1/Cohort_1/Grouping_1_Data.csv", "/path/Dataset_2/Cohort_1/Grouping_2_Data.csv", "/path/Dataset_1/Cohort_2/Grouping_3_Data.csv", "/path/Dataset_2/Cohort_2/Grouping_4_Data.csv", "/path/Dataset_1/Cohort_3/Grouping_5_Data.csv", "/path/Dataset_2/Cohort_3/Grouping_6_Data.csv", "/path/Dataset_1/Cohort_4/Grouping_7_Data.csv", "/path/Dataset_2/Cohort_4/Grouping_8_Data.csv")

dir.create(paste0(getwd(),paste(unique(dirname(dirname(dirname(file_list)))))))
lapply(paste0(getwd(),paste(unique(dirname(dirname(file_list))))),dir.create)
lapply(paste0(getwd(),paste(unique(dirname(file_list)))),dir.create)
csvdfs <- list()
csvdfs[[1]] <- structure(list(ID = 101:103, Binary = c(0L, 0L, 0L), Continuous = c(0L, 2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[2]] <- structure(list(ID = 102:104, Binary = c(0L, 0L, 0L), Continuous = c(2L, 5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[3]] <- structure(list(ID = 105:107, Binary = c(0L, 0L, 0L), Continuous = c(0L, 2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[4]] <- structure(list(ID = 106:108, Binary = c(0L, 0L, 0L), Continuous = c(2L, 5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[5]] <- structure(list(ID = 109:111, Binary = c(0L, 0L, 0L), Continuous = c(0L, 2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[6]] <- structure(list(ID = 110:112, Binary = c(0L, 0L, 0L), Continuous = c(2L, 5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[7]] <- structure(list(ID = 113:115, Binary = c(0L, 0L, 0L), Continuous = c(0L, 2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
csvdfs[[8]] <- structure(list(ID = 114:116, Binary = c(0L, 0L, 0L), Continuous = c(2L, 5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR")), class = "data.frame", row.names = c(NA, -3L))
for (outputs in 1:length(file_list)){
    write.csv(csvdfs[[outputs]],file=paste0(getwd(),file_list[[outputs]]))
    }
TDeramus
  • 77
  • 5

1 Answers1

2

Based on file_list, using strsplit to get the data with sub that needs to be inserted in the data frames, then using lapply and rbind to construct the desired result

csvdfs <- lapply(file_list, read.csv) # untested

insert <- lapply(strsplit(file_list, "/"), \(x) 
  sub("(^.).*_(\\d+).*", "\\1\\2", x[3:5]))

do.call(rbind, lapply(seq_along(csvdfs), \(x) 
  cbind(csvdfs[[x]][1], 
        Dataset = insert[[x]][1], 
        Cohort = insert[[x]][2], 
        Grouping = insert[[x]][3], 
        csvdfs[[x]][2:ncol(csvdfs[[x]])])))

output

    ID Dataset Cohort Grouping Binary Continuous    date1    date2 character
1  101      D1     C1       G1      0          0 20-02-01 21-01-01      CHAR
2  102      D1     C1       G1      0          2 21-03-01 22-01-01      CHAR
3  103      D1     C1       G1      0          5 22-04-01 23-01-01      CHAR
4  102      D2     C1       G2      0          2 20-03-01 22-01-01      CHAR
5  103      D2     C1       G2      0          5 21-04-01 23-01-01      CHAR
6  104      D2     C1       G2      0          7 22-05-01 23-01-01      CHAR
7  105      D1     C2       G3      0          0 20-02-01 21-01-01      CHAR
8  106      D1     C2       G3      0          2 21-03-01 22-01-01      CHAR
9  107      D1     C2       G3      0          5 22-04-01 23-01-01      CHAR
10 106      D2     C2       G4      0          2 20-03-01 22-01-01      CHAR
11 107      D2     C2       G4      0          5 21-04-01 23-01-01      CHAR
12 108      D2     C2       G4      0          7 22-05-01 23-01-01      CHAR
13 109      D1     C3       G5      0          0 20-02-01 21-01-01      CHAR
14 110      D1     C3       G5      0          2 21-03-01 22-01-01      CHAR
15 111      D1     C3       G5      0          5 22-04-01 23-01-01      CHAR
16 110      D2     C3       G6      0          2 20-03-01 22-01-01      CHAR
17 111      D2     C3       G6      0          5 21-04-01 23-01-01      CHAR
18 112      D2     C3       G6      0          7 22-05-01 23-01-01      CHAR
19 113      D1     C4       G7      0          0 20-02-01 21-01-01      CHAR
20 114      D1     C4       G7      0          2 21-03-01 22-01-01      CHAR
21 115      D1     C4       G7      0          5 22-04-01 23-01-01      CHAR
22 114      D2     C4       G8      0          2 20-03-01 22-01-01      CHAR
23 115      D2     C4       G8      0          5 21-04-01 23-01-01      CHAR
24 116      D2     C4       G8      0          7 22-05-01 23-01-01      CHAR

Data

file_list <- c("/path/Dataset_1/Cohort_1/Grouping_1_Data.csv", "/path/Dataset_2/Cohort_1/Grouping_2_Data.csv", 
"/path/Dataset_1/Cohort_2/Grouping_3_Data.csv", "/path/Dataset_2/Cohort_2/Grouping_4_Data.csv", 
"/path/Dataset_1/Cohort_3/Grouping_5_Data.csv", "/path/Dataset_2/Cohort_3/Grouping_6_Data.csv", 
"/path/Dataset_1/Cohort_4/Grouping_7_Data.csv", "/path/Dataset_2/Cohort_4/Grouping_8_Data.csv"
)

csvdfs <- list(structure(list(ID = 101:103, Binary = c(0L, 0L, 0L), Continuous = c(0L, 
2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", 
"22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR"
)), class = "data.frame", row.names = c(NA, -3L)), structure(list(
    ID = 102:104, Binary = c(0L, 0L, 0L), Continuous = c(2L, 
    5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", 
    "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR"
    )), class = "data.frame", row.names = c(NA, -3L)), structure(list(
    ID = 105:107, Binary = c(0L, 0L, 0L), Continuous = c(0L, 
    2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", 
    "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR"
    )), class = "data.frame", row.names = c(NA, -3L)), structure(list(
    ID = 106:108, Binary = c(0L, 0L, 0L), Continuous = c(2L, 
    5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", 
    "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR"
    )), class = "data.frame", row.names = c(NA, -3L)), structure(list(
    ID = 109:111, Binary = c(0L, 0L, 0L), Continuous = c(0L, 
    2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", 
    "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR"
    )), class = "data.frame", row.names = c(NA, -3L)), structure(list(
    ID = 110:112, Binary = c(0L, 0L, 0L), Continuous = c(2L, 
    5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", 
    "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR"
    )), class = "data.frame", row.names = c(NA, -3L)), structure(list(
    ID = 113:115, Binary = c(0L, 0L, 0L), Continuous = c(0L, 
    2L, 5L), date1 = c("20-02-01", "21-03-01", "22-04-01"), date2 = c("21-01-01", 
    "22-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR"
    )), class = "data.frame", row.names = c(NA, -3L)), structure(list(
    ID = 114:116, Binary = c(0L, 0L, 0L), Continuous = c(2L, 
    5L, 7L), date1 = c("20-03-01", "21-04-01", "22-05-01"), date2 = c("22-01-01", 
    "23-01-01", "23-01-01"), character = c("CHAR", "CHAR", "CHAR"
    )), class = "data.frame", row.names = c(NA, -3L)))
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
  • 1
    Thank you so much! Quick FYI, changing the first call to: csvdfs <- lapply(paste0(getwd(),file_list), read.csv) Makes it work. – TDeramus Aug 24 '23 at 12:31