0

I've done some research about this (here and here), but I haven't found what I actually want to achieve. The closest I've found to what I'm looking for is here, but the code doesn't seem to work or do what I desire. Besides, I found out that rbindlist has been deprecated in favour of bind_rows, but I haven't been able to use bind_rows to achieve what I want.

I have a list of 30 dataframes each with the same number of rows and columns, as well as the same column datatypes (though each column could be either continuous and categorical). I want to merge them into a single dataframe of the same number of rows and columns, but with each cell as a mean/median/majority voting of the corresponding 30 cells from the list of dataframes, for continuous, integer, and categorical columns, respectively. Here's an example with three dataframes:

df 1:

A       B       C
2.3     5       3
12      3       1
0.4     13      2


df_2:

A       B       C
4.3     23      1
1       7       2
0.4     10      2

df_3:

A       B       C
1.3      3      3
2.2      4      2
12.4     10     1

And the resulting dataframe would be something like:

df_result:

A       B        C
2.63     5       3
5.06     4       2
4.4     10       2

Any directions to more appropriate ways of combining each of the datatypes would also be highly appreciated.

ayePete
  • 411
  • 1
  • 7
  • 23

2 Answers2

2

Put a ROW ID on your tables

df_1 <- read_table("A       B       C
2.3     5       3
12      3       1
0.4     13      2") %>% 
  rowid_to_column("ROW") 


df_2 <- read_table("A       B       C
4.3     23      1
1       7       2
0.4     10      2") %>% 
  rowid_to_column("ROW") 

df_3 <- read_table("A       B       C
1.3      3      3
2.2      4      2
12.4     10     1") %>% 
  rowid_to_column("ROW")

Bind them together in an ensemble

ensamb <- bind_rows(df_1, df_2, df_3)

group_by row and then summarize each one by its own method

ensamb %>% 
  group_by(ROW) %>% 
  summarise(A = mean(A), B = median(B), 
            C = C[which.max(C)])


# A tibble: 3 x 4
    ROW     A     B     C
  <int> <dbl> <dbl> <dbl>
1     1  2.63     5     3
2     2  5.07     4     2
3     3  4.4     10     2
ayePete
  • 411
  • 1
  • 7
  • 23
David T
  • 1,993
  • 10
  • 18
  • Many thanks, David! Do you think there is a way I could use an `if` statement to apply either mean, median, or mode to each column by type? Listing columns one by one would be quite tedious, considering I have over 150 columns... – ayePete May 23 '20 at 23:27
  • 1
    Yes. Name them in a way that makes it straightforward to distinguish between the three column-types. Then use `summarise_at` to choose which ones to `mean`, `median` and majority. I can't get to it right now. Why don't you _accept_ one of the two answers you have, and then research `summarise_at`? If you get stuck, ask it as a separate question. – David T May 24 '20 at 00:50
  • 1
    I know you're kind of new; not long ago I was too. A little StackOverlfow etiquette: if somebody's answered a question for you, #ACCEPT# their answer. You've received several good answers; choose one to accept. You do that by clicking the gray checkmark (upper left of the post) and turning it green. You can also upvote a post with the arrows above the check. ALSO, you can upvote comments. – David T May 24 '20 at 17:54
  • All right. Thanks for the answers, David. I do know how to do the things you've said, but I believe you accept the response that adequately answers your question. I already upvoted your answer, and I'll upvote your comments too. I've just had a few things keeping me quite busy, hence the delay in upvoting. Cheers. – ayePete May 25 '20 at 11:19
  • Just a quick note for those who may be interested in this solution: `group_by(ROW) %>% summarize_if(is.numeric, mean)` was what eventually worked for me (as an example). For majority voting, `which.max()` doesn't give the desired result (only the max value of a vector, not the value with the max frequency), but @Ronak's linked `Mode` function does. – ayePete May 25 '20 at 17:38
  • *first index of the max value – ayePete May 25 '20 at 17:45
1

You can put all the dataframes in a list :

list_df <- mget(ls(pattern = 'df_\\d+'))

Then calculate the stats for each column separately.

data.frame(A = Reduce(`+`, lapply(list_df, `[[`, 1))/length(list_df),
           B = apply(do.call(rbind, lapply(list_df, `[[`, 2)), 2, median),
           C = apply(do.call(rbind, lapply(list_df, `[[`, 3)), 2, Mode), 
           row.names = NULL)

#         A  B C
#1 2.633333  5 3
#2 5.066667  4 2
#3 4.400000 10 2

where Mode function is taken from here :

Mode <- function(x) {
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
}
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Much appreciated. Is there any particular reason why you didn't use the mean function in the first line (for column A)? – ayePete May 23 '20 at 23:23
  • 1
    `mean` function gives you one mean for entire data but here you need elementwise mean. You can try it and see what output you get. – Ronak Shah May 23 '20 at 23:34