Create a new data frame with condition in group with R

Question

I need some help using r data frames. Here is the data frame:

group   col1    col2    name
1       dog     40      canidae
1       dog     40      canidae
1       dog     40      canidae
1       dog     40      canidae
1       dog     40  
1       dog     40      canidae
1       dog     40      canidae
2       frog    85      dendrobatidae
2       frog    89      leptodactylidae
2       frog    89      leptodactylidae
2       frog    82      leptodactylidae
2       frog    89 
2       frog    81 
2       frog    89      dendrobatidae
3       horse   87      equidae1
3       donkey  76      equidae2
3       zebra   67      equidae3
4       bird    54      psittacidae
4       bird    56  
4       bird    34  
5       bear    67    
5       bear    54

What I would like to get is to add a column "consensus_name" an get :

group col1   col2 name              consensus_name
1     dog    40   canidae           canidae
1     dog    40   canidae           canidae
1     dog    40                     canidae
1     dog    40   canidae           canidae
1     dog    40   canidae           canidae
2     frog   85   dendrobatidae     leptodactylidae
2     frog   89   leptodactylidae   leptodactylidae
2     frog   89   leptodactylidae   leptodactylidae
2     frog   82   leptodactylidae   leptodactylidae
2     frog   89                     leptodactylidae
2     frog   81                     leptodactylidae
2     frog   89   dendrobatidae     leptodactylidae
3     horse  87   equidae1          equidae3
3     donkey 76   equidae2          equidae3
3     zebra  67   equidae3          equidae3
4     bird   54   psittacidae       psittacidae
4     bird   56                     psittacidae
4     bird   34                     psittacidae
5     bear   67                     NA
5     bear   54                     NA

In order to get this new column for each group, I get the name which is the most representative of the group.

For the group1 there are 4 rows with the name 'canidae' and one with nothing, so for each one I write 'canidae' in the column consensus_name
For the group2 there are 2 rows with the name 'dendrobatidae', 2 with nothing and 3 rows with the name 'leptodactylidae' so for each one I write 'leptodactylidae' in the column consensus_name.
For the group3 there are 3 rows with different names, so because there is no consensus, I get the name which as the lowest col2 number, so I write 'equidae3' in the column consensus_name.
For the group 4 only one row have an information, so it is the consensus_name of the group4, so I write psittacidae in the column consensus_name.
For the group5 there is none informations, then just write NA in the consensus_name column.

Does anyone have any idea to do it with R ? Thank for your help :)

Here is the df:

structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), col1 = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 3L, 6L, 
1L, 1L, 1L), .Label = c("bird", "dog", "donkey", "frog", "horse", 
"zebra"), class = "factor"), col2 = c(40L, 40L, 40L, 40L, 40L, 
40L, 40L, 85L, 89L, 89L, 82L, 89L, 81L, 89L, 87L, 76L, 67L, 54L, 
56L, 34L), name = structure(c(2L, 2L, 2L, 2L, 1L, 2L, 2L, 3L, 
7L, 7L, 7L, 1L, 1L, 3L, 4L, 5L, 6L, 8L, 1L, 1L), .Label = c("", 
"canidae", "dendrobatidae", "equidae1", "equidae2", "equidae3", 
"leptodactylidae", "psittacidae"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

the real one has around 50 000 rows.

Welcome to Stack Overflow! Help others help you; rather than post your dataframe as you have, can you edit your question to instead present the dataframe using the output of `dput()`? (See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more details -- it makes it easier on answerers). Also, how large is your dataframe? (Asking to see if performance is a serious issue). — duckmayr, Feb 24 '19 at 12:44
Thanks for all these tips, I adde theses informations in the post. — chippycentra, Feb 24 '19 at 12:51

arg0naut91 · Answer 1 · 2019-02-24T15:07:22.950

Could go for:

library(dplyr)

df %>%
  add_count(group, name) %>% 
  group_by(group) %>%
  mutate(
    consensus_name = case_when(
      n_distinct(n[!name == ""]) > 1 ~ name[which.max(n[!name == ""])],
      TRUE ~ name[which.min(col2[!name == ""])]
    ),
    n = NULL
  )

Output:

# A tibble: 20 x 5
# Groups:   group [4]
   group col1    col2 name            consensus_name 
   <int> <fct>  <int> <fct>           <fct>          
 1     1 dog       40 canidae         canidae        
 2     1 dog       40 canidae         canidae        
 3     1 dog       40 canidae         canidae        
 4     1 dog       40 canidae         canidae        
 5     1 dog       40 ""              canidae        
 6     1 dog       40 canidae         canidae        
 7     1 dog       40 canidae         canidae        
 8     2 frog      85 dendrobatidae   leptodactylidae
 9     2 frog      89 leptodactylidae leptodactylidae
10     2 frog      89 leptodactylidae leptodactylidae
11     2 frog      82 leptodactylidae leptodactylidae
12     2 frog      89 ""              leptodactylidae
13     2 frog      81 ""              leptodactylidae
14     2 frog      89 dendrobatidae   leptodactylidae
15     3 horse     87 equidae1        equidae3       
16     3 donkey    76 equidae2        equidae3       
17     3 zebra     67 equidae3        equidae3       
18     4 bird      54 psittacidae     psittacidae    
19     4 bird      56 ""              psittacidae    
20     4 bird      34 ""              psittacidae

Hi thank for your help, I just added one group to show that sometime I can get none information about the name in one groupe, it is the cas of the groupe 5, in such a case I would like to write NA in the row consensus_name. — chippycentra, Feb 24 '19 at 17:42

moodymudskipper · Accepted Answer · 2019-02-24T14:27:39.680

We can use the definition of Mode from this answer (or any packaged version of the mode, but R doesn't have a native one unfortunately) and use dplyr to mutate at the relevant places.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

library(dplyr)
df %>%
  group_by(group) %>%
  mutate(consensus_name=Mode(setdiff(name[order(col2)],"")))
# # A tibble: 20 x 5
# # Groups:   group [4]
#    group col1    col2 name            consensus_name 
#    <int> <fct>  <int> <fct>           <chr>          
#  1     1 dog       40 canidae         canidae        
#  2     1 dog       40 canidae         canidae        
#  3     1 dog       40 canidae         canidae        
#  4     1 dog       40 canidae         canidae        
#  5     1 dog       40 ""              canidae        
#  6     1 dog       40 canidae         canidae        
#  7     1 dog       40 canidae         canidae        
#  8     2 frog      85 dendrobatidae   leptodactylidae
#  9     2 frog      89 leptodactylidae leptodactylidae
# 10     2 frog      89 leptodactylidae leptodactylidae
# 11     2 frog      82 leptodactylidae leptodactylidae
# 12     2 frog      89 ""              leptodactylidae
# 13     2 frog      81 ""              leptodactylidae
# 14     2 frog      89 dendrobatidae   leptodactylidae
# 15     3 horse     87 equidae1        equidae3       
# 16     3 donkey    76 equidae2        equidae3       
# 17     3 zebra     67 equidae3        equidae3       
# 18     4 bird      54 psittacidae     psittacidae    
# 19     4 bird      56 ""              psittacidae    
# 20     4 bird      34 ""              psittacidae

Hi thank for your help, I just added one group to show that sometime I can get none information about the name in one groupe, it is the cas of the groupe 5, in such a case I would like to write NA in the row consensus_name. — chippycentra, Feb 24 '19 at 17:43

score 0 · Answer 3 · answered Feb 24 '19 at 13:37

I'd second the use of the Mode function. Also, as an altnerative to dplyr you can do these types of operations quickly using the data.table package:

library('data.table')
Mode <- function(x) {
  ux <- setdiff(unique(x), '')
  ux[which.max(tabulate(match(x, ux)))]
}

dt <- as.data.table(df)
dt[, consensus_name:= Mode(name), by = .(group)]

Output:

dt
   group   col1 col2            name  consensus_name
 1:     1    dog   40         canidae         canidae
 2:     1    dog   40         canidae         canidae
 3:     1    dog   40         canidae         canidae
 4:     1    dog   40         canidae         canidae
 5:     1    dog   40                         canidae
 6:     1    dog   40         canidae         canidae
 7:     1    dog   40         canidae         canidae
 8:     2   frog   85   dendrobatidae leptodactylidae
 9:     2   frog   89 leptodactylidae leptodactylidae
10:     2   frog   89 leptodactylidae leptodactylidae
11:     2   frog   82 leptodactylidae leptodactylidae
12:     2   frog   89                 leptodactylidae
13:     2   frog   81                 leptodactylidae
14:     2   frog   89   dendrobatidae leptodactylidae
15:     3  horse   87        equidae1        equidae1
16:     3 donkey   76        equidae2        equidae1
17:     3  zebra   67        equidae3        equidae1
18:     4   bird   54     psittacidae     psittacidae
19:     4   bird   56                     psittacidae
20:     4   bird   34                     psittacidae

I'd rather have `name[name != '']` in the data.table call and not modify the `Mode` function, because as it is it's a mode that is really good only for this issue. — moodymudskipper, Feb 24 '19 at 14:13
You neglect the condition that using the lowest `col2` number in group 3. `consensus_name` in group 3 should be `equidae3`, not `equidae1`. — Darren Tsai, Feb 24 '19 at 14:51

Darren Tsai · Answer 4 · 2019-02-24T17:58:37.850

This is another dplyr solution. It uses given functions in the package to simplify some commands, e.g. min(), max(), which().

df %>% filter(name != "") %>%
       group_by(group) %>%
       add_count(name) %>%
       top_n(1, n) %>%
       top_n(-1, col2) %>%
       distinct(consensus_name = name) %>%
       right_join(df)  # edit: not left_join()

top_n(1, n) : It's equivalent to filter(n == max(n))
top_n(-1, col2) : It's equivalent to filter(col2 == min(col2))

Note: The two top_n() commands cannot be merged to

filter(n == max(n) & col2 == min(col2))

because the two logical statements are successive, not simultaneous.

Output

   group consensus_name  col1    col2 name           
   <int> <fct>           <fct>  <int> <fct>          
 1     1 canidae         dog       40 canidae        
 2     1 canidae         dog       40 canidae        
 3     1 canidae         dog       40 canidae        
 4     1 canidae         dog       40 canidae        
 5     1 canidae         dog       40 ""             
 6     1 canidae         dog       40 canidae        
 7     1 canidae         dog       40 canidae        
 8     2 leptodactylidae frog      85 dendrobatidae  
 9     2 leptodactylidae frog      89 leptodactylidae
10     2 leptodactylidae frog      89 leptodactylidae
11     2 leptodactylidae frog      82 leptodactylidae
12     2 leptodactylidae frog      89 ""             
13     2 leptodactylidae frog      81 ""             
14     2 leptodactylidae frog      89 dendrobatidae  
15     3 equidae3        horse     87 equidae1       
16     3 equidae3        donkey    76 equidae2       
17     3 equidae3        zebra     67 equidae3       
18     4 psittacidae     bird      54 psittacidae    
19     4 psittacidae     bird      56 ""             
20     4 psittacidae     bird      34 ""

@chippycentra I make a mistake in the last line of my code. I edit it. Just change `left_join()` to `right_join()` and the new case will be solved. — Darren Tsai, Feb 24 '19 at 17:55

Create a new data frame with condition in group with R

4 Answers4