0

A simplified version of my data looks like this:

primary_list <- list(secondary_list = list(df1 = data.frame(df_name = "hello", time = c(1,2,3,6,7,9,12,20), v1 = sample(0:1, 8, replace = T)),
                                           df2 = data.frame(df_name = "world", time = c(1,5,6,8,9,12,15,18,19,20), v1 = sample(0:1, 10, replace = T))),
                                           
                     secondary_list2 = list(df1 = data.frame(df_name = "hello", time = c(1,5,6,8,9,12,15,18,19,20), v1 = sample(0:1, 10, replace = T)),
                                            df2 = data.frame(df_name = "world", time = c(4,6,7,8,12), v1 = sample(0:1, 5, replace = T))))

My objective is to full_join all the dataframes in the secondary_lists (i.e., the ones with a common element name and df_name identifier) to a separate dataframe by 'time'. The aim is similar to this question but using a separate dataframe as the argument to full_join.

as a manual example, I'm trying to achieve something like this:

time_df <- data.frame(time = 1:25)

expected_output <- list(
  secondary_list = list(time_df, primary_list$secondary_list$df1, primary_list$secondary_list2$df1) %>% reduce(full_join, by = "time"),
  secondary_list2 = list(time_df, primary_list$secondary_list$df2, primary_list$secondary_list2$df2) %>% reduce(full_join, by = "time")
)

### Ideally I'm aiming for something cleaned up like this:                

ideal_output <- lapply(expected_output, function(x){
  x <- x %>% select(time,v1.x,v1.y)
  colnames(x) <- c("time", "v1", "v2")
  x
})

### I can probably handle that after

I'm trying to make a solution that works with a much larger list structure and computes the full_join for all the secondary_list elements

I've tried flattening the list like this but run into some complications indexing list elements sharing the same name. Also tried rbindlist to squash all the secondary lists into one dataframe but no joy there yet.

Solution does not have to be dplyr, just needs to work for n secondary_lists containing n dataframes of different lengths. For bonous points it would be great if the output does not repeat df_name.x, df_name.y as only one identifier column is necessary followed by the variables (see ideal_output) but I can clean that up afterwards easily enough.

Thanks in advance!

UPDATE I got three great answers to the question.

  • Ronak's is works great in a tidyverse pipeline workflow

  • ekoam's is a one liner and I'm all for succinct code

  • neon_ninja wins on the microbenchmark

Since I have a huge amount of data, speed is important so I'll accept neon_ninja's answer. Thanks all!

mbm = microbenchmark(
#####
ronak = primary_list %>%
  purrr::transpose() %>%
  map(~{
    .x$time_df <- time_df
    .x %>%
      reduce(full_join, by = 'time') %>% 
      select(time, df_name.x, starts_with("v")) ### Edited select fo n dfs
  }),

#####
ekoam = lapply(primary_list, compose(~select(., time, v1 = v1.x, v2 = v1.y), reduce), full_join, .init = time_df, by = "time"),
#####
neon_ninja = lapply(names(primary_list[[1]]), function(df_name) {
  time_df["df_name"] = df_name ### I added this to get a column of names
  time_df = data.frame(time = 1:25)
  for (sec_list_name in names(primary_list)) {
    time_df[sec_list_name] = NA
    df = primary_list[[sec_list_name]][[df_name]]
    for (i in 1:length(df$time)) {
      time_df[time_df$time == df$time[i], sec_list_name] = df$v1[i]
    }
  }
  time_df
})

)

#Unit: milliseconds
#       expr       min        lq      mean    median        uq      max neval
#      ronak 11.572526 12.507490 15.132911 13.491643 16.531374 28.11236   100
#      ekoam 10.710731 11.464450 13.665370 12.450013 14.204986 26.83166   100
# neon_ninja  1.116033  1.258137  1.784428  1.345762  1.643547 13.89803   100
QAsena
  • 603
  • 4
  • 9

3 Answers3

2

You can use purrr::transpose to get all the similar named dataframes in one list and then include time_df as a new dataframe in each list and do a full_join.

library(dplyr)
library(purrr)

primary_list %>%
  purrr::transpose() %>%
  map(~{
    .x$time_df <- time_df
    .x %>%
        reduce(full_join, by = 'time') %>% 
        select(time, v1 = v1.x, v2 = v1.y)
    })
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hey thanks, I think this is pretty close to what I'm after but I do need the `full_join` to be to a separate dataframe and to keep the variables (v1.x, v1.y...) as columns. I'll edit the question a bit :). – QAsena Oct 19 '20 at 21:52
  • @QAsena So you don't want `df_name` in the output? Can you check my updated answer if that is what you want? – Ronak Shah Oct 19 '20 at 23:44
  • yeah @Ronak Shah that's closer! But you see in the manual example, the `full_join` uses a separate `time_df` as the first argument for all the joins? I'm just trying to see where I can insert that as a unchanging argument into the `map`. @neon_ninja answer seems to work. – QAsena Oct 20 '20 at 00:23
  • @QAsena Ok...I completely missed the `time_df` part. I have included that in the answer above. – Ronak Shah Oct 20 '20 at 00:50
  • Awesome! Some new syntax for me to learn in there. Thanks a lot! – QAsena Oct 20 '20 at 03:34
1

How about this?

c(list(time_df), unlist(primary_list, False, False)) %>% reduce(full_join, by = "time")

Update

lapply(primary_list, compose(~select(., time, v1 = v1.x, v2 = v1.y), reduce), full_join, .init = time_df, by = "time")

Output

$secondary_list
   time v1 v2
1     1  1  1
2     2  0 NA
3     3  1 NA
4     4 NA NA
5     5 NA  1
6     6  1  0
7     7  1 NA
8     8 NA  1
9     9  0  0
10   10 NA NA
11   11 NA NA
12   12  0  0
13   13 NA NA
14   14 NA NA
15   15 NA  0
16   16 NA NA
17   17 NA NA
18   18 NA  0
19   19 NA  0
20   20  1  1
21   21 NA NA
22   22 NA NA
23   23 NA NA
24   24 NA NA
25   25 NA NA

$secondary_list2
   time v1 v2
1     1  1 NA
2     2 NA NA
3     3 NA NA
4     4 NA  1
5     5  0 NA
6     6  0  1
7     7 NA  1
8     8  1  1
9     9  1 NA
10   10 NA NA
11   11 NA NA
12   12  1  1
13   13 NA NA
14   14 NA NA
15   15  1 NA
16   16 NA NA
17   17 NA NA
18   18  1 NA
19   19  0 NA
20   20  0 NA
21   21 NA NA
22   22 NA NA
23   23 NA NA
24   24 NA NA
25   25 NA NA
ekoam
  • 8,744
  • 1
  • 9
  • 22
  • Thanks! This is pretty close but returns 1 list where I'm expecting 2 (one list of all the df1s and another for the df2s) I'll edit the question to include a more complete expected output :) – QAsena Oct 19 '20 at 22:00
  • Genius! Works great, appreciate it :) – QAsena Oct 20 '20 at 03:33
1

Here's how I would solve this:

lapply(names(primary_list[[1]]), function(df_name) {
  time_df = data.frame(time = 1:25)
  for (sec_list_name in names(primary_list)) {
    time_df[sec_list_name] = NA
    df = primary_list[[sec_list_name]][[df_name]]
    for (i in 1:length(df$time)) {
      time_df[time_df$time == df$time[i], sec_list_name] = df$v1[i]
    }
  }
  time_df
})

If you need to find out your time range, here's a one liner to do that:

range(unlist(lapply(unlist(primary_list, F, F), function(df) { df$time })))

Similarly if you need to find out df names:

unique(unlist(lapply(primary_list, names)))

neon_ninja
  • 61
  • 2