3

I have ~50 data files (subjects) that I process individually before I combine them in a data.frame for modelling. I'm unsure how to best use {targets} for this.

I tried using dynamic branching, but I'm unsure how to keep track of subject IDs with this approach. I my current approach I have all data in a named list where first level names are subject IDs, but with targets the names are arbitrary.

I know this is not really a specific questions, but I'm hoping to be pointed towards an appropriate solution instead of getting a "correct" answer for a wrong question.

JohannesNE
  • 1,343
  • 9
  • 14
  • 1
    I think dynamic branching is probably the way to go, if any individual file changes it updates, but when new files get in it only process those, you also delay combining the files for as long as possible most of the time because them the expensive computations are combining + (1 file being processed) – Bruno Oct 20 '21 at 20:51
  • 1
    Also you dont need to keep track of anything, targets is responsible for checking if new files got inserted into the path, or if old files were changed or removed – Bruno Oct 20 '21 at 20:52

1 Answers1

4

This is the pattern that I normally use

  tar_files(
    file_paths,
    "file_paths_folder" %>%
      list.files(full.names = TRUE)
  ),
  tar_target(
    processed_files,
    file_paths%>%
      readxl::read_excel() %>% # can be anything read csv, parquet etc.
      janitor::clean_names() %>% # start processing
      mutate_at(vars(a,b,c), as.Date, format = "%Y-%m-%d"), # can be really complex operations
    pattern = map(file_paths)
  )
Bruno
  • 4,109
  • 1
  • 9
  • 27
  • you will see a bunch of really comples id's they are mostly unreadable – Bruno Oct 20 '21 at 20:58
  • Thank you. If I later want to `processed_files` item corresponding to a specific data file, how would I do this? Also, if I create i.e. 3 features for each file as separate branched targets, and want to combine them into a data.frame with one row per file, how do I ensure that braches line up correctly? – JohannesNE Oct 21 '21 at 05:41
  • 1
    If you are parallel processing you will need to sort it after, you can pass some column with a date or something – Bruno Oct 21 '21 at 13:11
  • Thanks. I have both lists and data.frames, so I think I'll try to use attributes to keep track of the origins. – JohannesNE Oct 21 '21 at 13:29