compare two dataframes and get maximum values

Question

I am trying to compare two data frames (df1, df2) of same structure (same dimensions, column names, row names, etc) and keep the maximum values between the two data frames. I actually have hundreds of columns and rows, but here is some pretend data:

df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 2 Yellow
2013-11-24 Orange  8 Orange
2013-11-24 Apple   7 Green
2013-11-24 Celery 10 Green

df2:
Date       Fruit  Num  Color 
2013-11-24 Banana 22 Yellow
2013-11-24 Orange  8 Orange
2013-11-24 Apple   7 Green
2013-11-24 Celery 1 Green

There are many examples on SO doing similar things but in python not R: Comparing two dataframes and getting the differences, Compare two dataframes to get comparison value in in another dataframe etc.

I tried a dplyr approach but I don't know how to do this correctly for all the columns (hundreds).

library(dplyr)
test <- rbind(df1, df2)
test2 <- test %>%
  group_by(Date) %>%
summarise(max = max(.))

Given my pretend data above, the desired output should be:

new.df:
Date       Fruit  Num  Color 
2013-11-24 Banana 22 Yellow
2013-11-24 Orange  8 Orange
2013-11-24 Apple   7 Green
2013-11-24 Celery 10 Green

Thanks for the help.

You could join the tables together by date, fruit, and color and then `mutate` a new column with the maximum of the two Num values. — cardinal40, May 20 '19 at 21:46
That sounds like a nice approach too - I'll see if I can figure out how.. — KNN, May 21 '19 at 14:33

Jilber Urbina · Answer 1 · 2019-05-21T15:45:33.387

1

Try this:

test %>%
  group_by_if(.,is.factor) %>%
  summarise_if(is.numeric, max)

# A tibble: 4 x 4
# Groups:   Date, Fruit [?]
  Date       Fruit  Color    Num
  <fct>      <fct>  <fct>  <dbl>
1 2013-11-24 Apple  Green      7
2 2013-11-24 Banana Yellow    22
3 2013-11-24 Celery Green     10
4 2013-11-24 Orange Orange     8

edited May 21 '19 at 15:45

answered May 20 '19 at 21:46

Jilber Urbina

58,147
10
114
138

@Jiber Urbina This there a way to use this approach without having to specify each column by name? Not possible in my case since I have hundreds of columns.. – KNN May 21 '19 at 14:29
1

@KNN, Yes, there is! see my edit. Just use `group_by_if()` – Jilber Urbina May 21 '19 at 15:46

M-- · Accepted Answer · 2019-05-20T22:46:38.067

One possibility is grouping by all the non-numeric columns and then getting the max for numeric ones:

library(tidyverse)

rbind(df1, df2) %>%
    group_by_at(vars(one_of(names(select_if(df2,negate(is.numeric)))))) %>%
    summarise_if(is.numeric, max)

#> # A tibble: 4 x 4
#> # Groups:   Date, Fruit [4]
#>   Date       Fruit  Color    Num
#>   <fct>      <fct>  <fct>  <dbl>
#> 1 2013-11-24 Apple  Green      7
#> 2 2013-11-24 Banana Yellow    22
#> 3 2013-11-24 Celery Green     10
#> 4 2013-11-24 Orange Orange     8

^{Created on 2019-05-20 by the reprex package (v0.2.1)}

You can also try joining two dataframes and then keeping the maximum values:

df1 %>% right_join(df2, by=c("Date","Fruit","Color")) %>% 
        mutate(Num = pmax(Num.x, Num.y)) %>% select(-Num.x, -Num.y)

Great! I did a group_by after the rbind followed by your answer and this seems to work perfectly. Thank you and thank you to all the replies. — KNN, May 21 '19 at 14:32

score -1 · Answer 3 · answered May 20 '19 at 21:56

-1

or try

set.seed(1234) 
df1 = data.frame(x1=runif(10),x2= 10*runif(10))
df2 = data.frame(x1=runif(10),x2= 20*(runif(10)-0.5))

(df1$x2 >= df2$x2)* df1$x2 + (df1$x2 < df2$x2)* df2$x2

answered May 20 '19 at 21:56

greengrass62

968
7
19

compare two dataframes and get maximum values

3 Answers3