0

Below is a working code, where

enter image description here

  1. I read lists of csv files from 9 different folders

enter image description here 2. Make each folder a list.

  1. Rename elements of a list (drop '.csv' and others).

  2. Join within each folder to make 1 big dataframe.

  3. Merge all 9 dataframes to make 1 dataframe.

  4. Get follow up RIDs and loss-to-follow-up RIDs and their rates.

  • From as1 to as3, there are AS{n}_AREA columns and from as4 to as9 there are AS{n}_DATA_CLASS columns For MRE, enter image description here enter image description here

I have used ideas given here mostly, but when I add them up, the code somehow looks redundant and more could be done to look tidy to others.

Any ideas please? Thank you!

library(dplyr); library(plyr)
library(magrittr); library(stringr) 
library(ExclusionTable)
library(lubridate)
library(tidyverse); library(tidyr)
library(janitor)
library(survival)
library(ggsurvfit); library(gtsummary)
library(zoo)
library(tidycmprsk)

# AA cohort (2 of 3)
## as

i=1
num_fu = c(1,2,3,4,5,6,7,8,9)
as <- data.frame()
df <- data.frame()
dfs <- data.frame()
data_dir <- 'C:/Users/thepr/Documents/data/as'

assign(paste0("flnames", i), list.files(path = paste0(data_dir, i), pattern = "\\.csv", full.names = TRUE))
assign(paste0("as", i, "_list"), lapply(get(paste0("flnames", i)),
                                        function(x){base::as.data.frame(read.csv(x))}))
nm <- gsub(".csv", "", basename(eval(parse(text = paste0("flnames", i))))) %>% str_sub(., 1,6)
assign(paste0("as", i, "_list"), setNames(get(paste0("as", i, "_list")), nm))
df <- Reduce(full_join, get(paste0("as", i, "_list")))
assign(paste0("as",i), df[!duplicated(base::as.list(df))])
dfs <- df


for (i in 2:length(num_fu)){
RID_common <- as1$RID %in% get(paste0("as", i))$RID

      assign(paste0("flnames", i), list.files(path = paste0(data_dir, i), pattern = "\\.csv", full.names = TRUE))
      assign(paste0("as", i, "_list"), lapply(get(paste0("flnames", i)),
                                              function(x){base::as.data.frame(read.csv(x))}))
      nm <- gsub(".csv", "", basename(eval(parse(text = paste0("flnames", i))))) %>% str_sub(., 1,6)
      assign(paste0("as", i, "_list"), setNames(get(paste0("as", i, "_list")), nm))
      df <- Reduce(full_join, get(paste0("as", i, "_list")))
      assign(paste0("as",i), df[!duplicated(base::as.list(df))])
      
      dfs <- merge(dfs, df, by = "RID", all.x = TRUE)
      dfs <- dfs[!duplicated(base::as.list(dfs))]
            if(paste0("AS", i, "_AREA") %in% colnames(get(paste0("as", i)))){
              assign(paste0("fu_",i-1), get(paste0("as", i))[RID_common, c("RID", paste0("AS", i, "_AREA"))])
              assign(paste0("fu_loss_",i-1), get(paste0("as", i))[!RID_common, c("RID", paste0("AS", i, "_AREA"))])
            # FU rate
              assign(paste0("fu_rate_", i-1), nrow(get(paste0("as", i)))/nrow(as1))
            }
            else if(paste0("AS", i, "_DATA_CLASS") %in% colnames(get(paste0("as", i)))){
              assign(paste0("fu_",i-1), get(paste0("as", i))[RID_common, c("RID", paste0("AS", i, "_DATA_CLASS"))])
              assign(paste0("fu_loss_",i-1), get(paste0("as", i))[!RID_common, c("RID", paste0("AS", i, "_DATA_CLASS"))])
            # FU rate
              assign(paste0("fu_rate_", i-1), nrow(get(paste0("as", i)))/nrow(as1))
            }
            else{}
}
HJ WHY
  • 23
  • 8
  • 2
    `library(dplyr); library(plyr)` is terrible, the `plyr` versions of `mutate` and `summarize` will take precedence over the more advanced `dplyr` versions and mess things up. Are you even using any `plyr` functions?? `library(magrittr)` is almost never needed unless you're using the extra fancy pipes (which you are not), `library(tidyverse)` loads `dplyr`, `stringr`, and `tidyr` so either don't load those separately or skip the tidyverse, and I don't see how you're using most of the other packages you include, are they needed? – Gregor Thomas Jun 06 '23 at 01:41
  • 4
    I sincerely hope you didn't get the advice to repeatedly use `assign()`, `get()` and `eval(parse()` from questions on this site. – joran Jun 06 '23 at 01:43
  • 3
    Moving beyond the sloppy package loading, all the `assign()` and `get()` and pasting variable names is generally frowned on as bad practice. You're using lists, but you're missing the point of using lists. I've written about this a lot at the [How to make a list of data frames?](https://stackoverflow.com/a/24376207/903061) FAQ, I'd suggest reading my answer there. I don't know how useful we can be here without a reproducible example, but reading that should hopefully set you in a better direction. – Gregor Thomas Jun 06 '23 at 01:45
  • 1
    Very general sketch: use `list.files` to generate a character vector of the file paths to the folders `as1`, `as2`, etc. Iterate over that vector and for each value use `list.files` to generate the file paths to the csv's in each folder. Use `lapply` or `purrr::map` to read the files as a list of data frames. Set the names of the elements of that list by extracting the name of each file from the file path (eg `setNames()`). – joran Jun 06 '23 at 01:56
  • @joran unfortunately, I did, and I am just beginning to understand R way of thinking. – HJ WHY Jun 06 '23 at 02:13
  • @Gregor Thomas There are pipe uses later in the code. I will tidy up packages ... thank you. And I will read it thanks. – HJ WHY Jun 06 '23 at 02:14
  • 1
    I mean, I see `%>%` - which is re-exported by `dplyr` and by `tidyr` (and possibly some of the other packages you're loading), so you don't need `magrittr` for that. You only need to load `magrittr` if you're using the more specialty pipes like `%<>%`, `%T>%` and `%$%`. – Gregor Thomas Jun 06 '23 at 02:18
  • @Gregor Thomas ```as %<>% mutate(row = row_number()) %>% pivot_longer(starts_with("AS") & ends_with("_WEIGHT"), names_to = "new_name", values_to = "new_value") %>% mutate(new_value = if_else(new_value == '99999', lead(new_value), new_value), .by = row) %>% pivot_wider(names_from = new_name, values_from = new_value)``` I find %<>% more comfortable but at the cost of adding 'magrittr'. I will consider removing this as well. – HJ WHY Jun 06 '23 at 02:23
  • @joran Thank you, given the folder structure above, could you please elaborate on your general sketch? I will look into purrr::map and lapply more... – HJ WHY Jun 06 '23 at 03:47
  • 1
    @HJWHY, the gist of joran's comment is covered in much detail in GregorThomas's list-of-frames answer. – r2evans Jun 06 '23 at 12:15

2 Answers2

1

Use Sys.glob and abbreviate. This gives a flat list rather than a list of lists. No packages are used.

data_dir <- "C:/Users/thepr/Documents/data"
pat <- file.path(data_dir, "as[1-9]", "*.csv")
files <- Sys.glob(pat)

L <- Map(read.csv, files)
names(L) <- abbreviate(basename(names(L)), 6)

or maybe use these names instead of the last line:

names(L) <- paste(basename(dirname(files)), basename(files), sep = ".") |>
  abbreviate(6)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for your input! Here, the datastructure is of cohort type. So it is more intuitive to keep it in a nested list; One list for each cohort follow up. I will look into Sys.glob() function. – HJ WHY Jun 10 '23 at 01:39
  • And using the regex for as[1-9] is great for parsimonial purpose. – HJ WHY Jun 10 '23 at 01:42
0

Thanks to @Gregor Thomas, library could be tidied as follows:

library(tidyverse) #Includes: dplyr, stringr, tidyr, purrr
library(magrittr)
library(lubridate)
library(ExclusionTable)
library(zoo) #as.Date function
library(janitor) #For Regression analysis
library(survival) #For Regression analysis
library(ggsurvfit) #For Regression analysis
library(gtsummary) #For Regression analysis 
library(tidycmprsk) #For Regression analysis

Thanks to @joran and @moodymudskipper, instead of using for loops and eval(parse(text = ...))) or get(paste0( ...)), I have used vectors and lapply.

The thing here is the use of nested lists. I have a list of folders where each folder contains 8+ csv files.

data_dir <- "C:/Users/thepr/Documents/data/as"
num_fu <- 1 : 9
dirs <- paste0(data_dir, num_fu)


as_list <- lapply(dirs, function(x) {
  files <- list.files(x, pattern = "\\.csv$", full.names = TRUE)
  names(files) <- str_sub(basename(files), 1, 6)
  Reduce(full_join, lapply(files, read.csv))
})

As per follow-up and loss rate, I will post concise answers soon!

HJ WHY
  • 23
  • 8