0

I am looking for the most elegant way loop through and read in multiple files organized by date and select the most recent value if anything changed based on multiple keys.

Sadly, the reason I need to read in all the files and not just the last files is because there could be an instance in the file that disappears that I would like to capture.

Here is an example of what the files looks like (I'm posting comma separated even though it's fixed width)

file_20200101.txt
key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101

file_20200102.txt
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102

and an example of the desired output:

desired_df
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
100,foo,15,20200101

In addition, here is some code I know that works to read in multiple files and then get my ideal output, but I need it to be inside of the loop. The dataframe would be way too big to import and bind all the files:

files <- list.files(path, pattern = ".txt")

df <- files %>%
  map(function(f) {
    
    print(f)
    
    df <- fread(f)
    df <- df %>% mutate(date_as_numb = f)
    
    return(df)
    
  }) %>% bind_rows()

df <- df %>% 
  mutate(file_date = as.numeric(str_remove_all(date_as_numb, ".*_"))) %>% 
  group_by(key_1, key_2) %>% 
  filter(date_as_numb == max(date_as_numb))

Thanks in advance!

alexb523
  • 718
  • 2
  • 9
  • 26

1 Answers1

0

Don't know what is "way too big" for you. Data table can (allegedly) handle really big data. So if bind_rows of the list is not ok, maybe use data table.

(in my own experience, dplyr::group_by can be really slow with many groups (say 10^5 groups) in large-ish data (say around 10^6 rows). I don't have much experience with data.table, but all the threads mention its superiority for large data).

I've used this answer for merging a list of data tables

library(data.table)
dt1 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101")

dt2 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102")

ls_files <- list(dt1, dt2) 

# you would have created this list by calling fread with lapply, like 
# ls_files <- lapply(files, fread)

# Now here a two-liner with data table.

alldata <- Reduce(function(...) merge(..., all = TRUE), ls_files)

alldata[alldata[, .I[which.max(date_as_numb)], by = .(key_1, key_2)]$V1]
#>    key_1 key_2 value date_as_numb
#> 1:   100   foo    15     20200101
#> 2:   123   abc    50     20200102
#> 3:   456   def   500     20200102
#> 4:   789   xyz   300     20200102

Created on 2021-01-28 by the reprex package (v0.3.0)

tjebo
  • 21,977
  • 7
  • 58
  • 94
  • i see your point about big data, but R is kind of memory hog. I've had a couple experiences where it terminates with not really "that much data". Reading it in like my example I'm at about 2.7 mil rows with 5 files and have about 1020 files to go. I'll for sure try this solution though! – alexb523 Jan 28 '21 at 23:55
  • one of the reasons for memory inefficiency is the growing objects problem [R inferno circle 2])https://www.burns-stat.com/pages/Tutor/R_inferno.pdf - so if you use loops, avoid that!! – tjebo Jan 29 '21 at 08:16
  • that seems really interesting, but sadly something i'm going to have to put on a long list of things to read about r :( I use `map` a lot which does put the data into lists and i have started to use lists more in the last year or so, but i have quite figured out how to process info like in my question with lists. – alexb523 Jan 29 '21 at 15:28
  • One thing I thought about last night is to just loop through the files, and write them out, then read the file in again in my next iteration. So it would be like `if 1st file > read in txt > write out csv > next read in csv AND read in new txt > get most up to date instance > write out csv` – alexb523 Jan 29 '21 at 15:33