3

Given a list of dataframes, like this

[[1]]
ID    X1   X2   X3   X4   X5
123   1    2    1    0    4
123   2    4    1    2    2
123   0    0    1    0    2
567   2    3    4    0    2
899   2    3    5    3    5

[[2]]
ID    X1   X2   X3   X4
123   7    1    3    2
452   1    3    6    3
899   2    1    4    2

[[3]]
ID    X1   X2   X3   X4   Y1
123   4    2    5    0    1
567   0    1    2    0    2
567   1    2    1    0    3
101   2    3    7    2    1

I want to aggregate all the dataframes by "ID" in all the other columns, the result in this illustration should be something like this

new_df
ID    X1   X2   X3   X4   X5   Y1
123   14   9    11   2    8    1
567   3    6    7    0    2    5    
899   4    4    9    5    5    0
452   1    3    6    3    0    0
101   2    3    7    2    0    1

This is only a illustration, the original list has 51 dfs and the original dfs have hundreds of columns, so the function cannot specify which columns will agreggate, instead should apply to all columns

Thanks in advance

  • Possible duplicate https://stackoverflow.com/questions/52449113/merge-2-data-frames-by-row-and-column-overlap – markus Oct 16 '18 at 21:49
  • If the error happens with `bind_rows` ... perhaps `data.table::rbindlist` *might* be able to do it (I don't hold much confidence) ... perhaps your best bet is to normalize them (same columns names, meaning you'll need to "fill" some empty columns), dump them to a single file, then try again. How disparate are the columns? That is, in the example you provided, there is a column missing from 1 to 2, then a new column in 3 ... are the number of differences relatively small (1s or 2s) or do they vary significantly? – r2evans Oct 16 '18 at 22:56
  • data.table::rbindlist didnt work. The columns are very disparate, there are about 22-40 columns in each df and all of them are different from each other. What repeat are the IDs – André Segadas Figueiredo Oct 17 '18 at 16:33
  • Now it worked! I aggregated each one of the DFs in the list individually, since the columns in the original list are different from each other, and then merged all the dfs with reduce – André Segadas Figueiredo Oct 17 '18 at 18:17

2 Answers2

3

We can use dplyr::summarize_all

# some sample data (different from yours, just to illustrate)
df_list = list(
  data.frame(ID = c(123, 123, 234), x1 = c(1, 2, 3), x2 = 2:4),
  data.frame(ID = c(123, 123, 234), x1 = c(1, 2, 3), x2 = 2:4, x3 = 4:6),
  data.frame(ID = c(123, 123), x1 = c(1, 2), x3 = 2:3)
)

library(dplyr)
bind_rows(df_list) %>%
  group_by(ID) %>%
  summarize_all("sum", na.rm = TRUE)
# # A tibble: 2 x 4
#      ID    x1    x2    x3
#   <dbl> <dbl> <int> <int>
# 1   123     9    10    14
# 2   234     6     8     6

Edits: Since you're bumping up against memory constraints, the best you can do in dplyr is to aggregate each data frame individually, thus reducing it's size, before combining and aggregating again. I would suggest a simple for loop approach to avoid trying to copy all the data at once:

for (i in seq_along(df_list)) {
  df_list[[i]] = df_list[[i]] %>%
    group_by(ID) %>%
    summarize_all("sum", na.rm = TRUE)
}
# Then use the code from above
result = bind_rows(df_list) %>%
  group_by(ID) %>%
  summarize_all("sum", na.rm = TRUE)

The extent to which this will help depends completely on how much ID repetition there is within each data frame. If this still doesn't work then the next thing to try would be using data.table, which allows modifications in place:

library(data.table)

for (i in seq_along(df_list)) {
  setDT(df_list[[i]]) # convert data frames to data.tables
  # pre-aggregate to reduce size
  df_list[[i]] = df_list[[i]][ , lapply(.SD, sum, na.rm = TRUE), by = .(ID)]  
}

# combine and aggregate
big_dt = rbindlist(df_list, fill = TRUE)
big_dt = big_dt[ , lapply(.SD, sum, na.rm = TRUE), by = .(ID)]

The data.table solution should be quite memory efficient. If you still run into memory problems, make sure your workspace is as empty as possible and you don't have other applications eating up memory either. If that doesn't work, turn to the R-FAQ Cannot allocate vector of size n for additional tips (like using a computer with more memory or using packages that allow for out-of-memory calculations).

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1
> microbenchmark(
+     test1 <- df_list %>% 
+     ldply %>% 
+     group_by(ID) %>% 
+     summarize_all('sum', na.rm=TRUE), 
+     
+     test2 <- df_list %>% 
+         ldply %>% 
+         ddply(.(ID), summarize, 
+               x1 = sum(x1, na.rm=TRUE), 
+               x2 = sum(x2, na.rm=TRUE), 
+               x3 = sum(x3, na.rm=TRUE)), 
+     
+     test3 <- df_list %>% 
+         bind_rows %>% 
+         group_by(ID) %>% 
+         summarize_all('sum', na.rm=TRUE), 
+     
+     test4 <- df_list %>% 
+         bind_rows %>% 
+         ddply(.(ID), summarize, 
+               x1 = sum(x1, na.rm=TRUE), 
+               x2 = sum(x2, na.rm=TRUE), 
+               x3 = sum(x3, na.rm=TRUE)), 
+     
+     test5 <- rbindlist(df_list, fill = TRUE) %>% 
+         .[ , lapply(.SD, sum, na.rm = TRUE), by = .(ID)]
+     )
Unit: milliseconds

  expr
  test1 <- df_list %>% ldply %>% group_by(ID) %>% summarize_all("sum", na.rm = TRUE)
  test2 <- df_list %>% ldply %>% ddply(.(ID), summarize, x1 = sum(x1, na.rm = TRUE), x2 = sum(x2, na.rm = TRUE), x3 = sum(x3, na.rm = TRUE))
  test3 <- df_list %>% bind_rows %>% group_by(ID) %>% summarize_all("sum", na.rm = TRUE)
  test4 <- df_list %>% bind_rows %>% ddply(.(ID), summarize, x1 = sum(x1, na.rm = TRUE), x2 = sum(x2, na.rm = TRUE), x3 = sum(x3, na.rm = TRUE))
  test5 <- rbindlist(df_list, fill = TRUE) %>% .[, lapply(.SD, sum, na.rm = TRUE), by = .(ID)]
          min       lq     mean   median       uq       max neval cld
     2.798383 2.977228 3.147271 3.130156 3.279235  3.840657   100   c
     2.797870 3.034704 3.408260 3.220733 3.340305 22.279300   100   c
     2.146642 2.309835 2.633888 2.429919 2.563603 20.407723   100  b 
     2.254924 2.431202 2.598176 2.533325 2.688306  3.581499   100  b 
     1.209571 1.331708 1.430511 1.423055 1.508243  2.172301   100 a

Here I compare the efficiency of answers provided by @Gregor.