2

I have a dataframe as below. I want to combine rows based on duplicates in column person. However, is it possible for specified columns (Beer, Cola, Wodka in this case) that a certain value (1 in this case) overrules other values (0 in this case).

Current dataframe:

person <- c("John", "John", "Alex", "Nicole", "Nicole")
Sex <- c("M","M","W", "W", "W")
Beer <- c(1,1,1,1,0)
Cola <- c(0,1,0,0,0)
Wodka <- c(0,1,0,0,1)
df <- data.frame(person,Sex,Beer,Cola,Wodka)

Outcome should be:

person <- c("John", "Alex", "Nicole")
Sex <- c("M", "W", "W")
Beer <- c(1,1,1)
Cola <- c(1,0,0)
Wodka <- c(1,0,1)
df <- data.frame(person,Sex,Beer,Cola,Wodka)

Thanks.

joffie
  • 205
  • 2
  • 12

3 Answers3

1

Using dplyr, you can summarise() to get one row per person, and take the maximum of the specified columns:

library(tidyverse)

person <- c("John", "John", "Alex", "Nicole", "Nicole")
Sex <- c("M", "M", "W", "W", "W")
Beer <- c(1, 1, 1, 1, 0)
Cola <- c(0, 1, 0, 0, 0)
Wodka <- c(0, 1, 0, 0, 1)

df <- data.frame(person, Sex, Beer, Cola, Wodka)

df %>% 
  group_by(person, Sex) %>% 
  summarise(across(c(Beer, Cola, Wodka), max))
#> `summarise()` regrouping output by 'person' (override with `.groups` argument)
#> # A tibble: 3 x 5
#> # Groups:   person [3]
#>   person Sex    Beer  Cola Wodka
#>   <chr>  <chr> <dbl> <dbl> <dbl>
#> 1 Alex   W         1     0     0
#> 2 John   M         1     1     1
#> 3 Nicole W         1     0     1
Mikko Marttila
  • 10,972
  • 18
  • 31
0

Suggest to use dplyr library from tidyverse. Then this should do what you want to achieve:

df %>%
    group_by(person) %>% 
    summarize(Beer = max(Beer), Cola = max(Cola), Wodka = max(Wodka), Sex = max(Sex))

person <- c("John", "Alex", "Nicole")
Sex <- c("M", "W", "W")
Beer <- c(1,1,1)
Cola <- c(1,0,0)
Wodka <- c(1,0,1)
df <- data.frame(person,Sex,Beer,Cola,Wodka)
Wolfgang Arnold
  • 1,252
  • 8
  • 17
0

A simple base R solution can be:

#Split according to persons
#Every element of the list personSplit is a dataframe containing all available
#informations regarding one person
personSplit <- split(df,df$person)

#Out of these informations, choose the one value overruling each other.
#In my case, overruling only applies to numeric values, where you can simply take the max.
#For non-numerics, I simply use the first value.
valuesToTake <- lapply(personSplit, function(personalInfoDf) {
  vals <- lapply(personalInfoDf, function(column) {
    if(is.numeric(column)) {
      max(column, na.rm=T)
    } else {
      column[1]
    }
  })
  data.frame(vals)
})

result <- do.call("rbind",valuesToTake)
rownames(result) <- NULL
result
Jonas
  • 1,760
  • 1
  • 3
  • 12