0

I have a data set with a column names and a binary (0/1) column variable called column1. I would like to add three columns:

  • percentage, that summarizes for each person all 1's and divide it by total for that person, thus every 1's + 0's. For John, this value would be 0.667 (2/3), for Rick this would be 0.5 (1/2) etc.
  • total1 and total2 as summations of respectively the 1's or 0's from the column1, again for each person in names.

Example script + desired output. How to best approach this?

names <- c("John", "John", "John", "John", "Rick", "Rick", "Katie", "Katie", "Katie", "Harry", "Harry" )
column1 <- c(1,0,1,NA,0,1,1,1,0,1,1)
df1 <- data.frame(names,column1)

#Desired ouput
#names  column1  percentage   total1  total0
#John   1        0.667        2       1
#John   0        0.667        2       1
#John   1        0.667        2       1
#John   NA       0.667        2       1
#Rick   0        0.500        1       1
#Rick   1        0.500        1       1
#Katie  1        0.667        2       1  
#Katie  1        0.667        2       1
#Katie  0        0.667        2       1
#Harry  1        1.000        2       0
#Harry  1        1.000        2       0

Silhouettes
  • 145
  • 1
  • 10

2 Answers2

2

Using dplyr, here is one way :

library(dplyr)

df1 %>%
  group_by(names) %>%
  mutate(percentage = mean(column1, na.rm = TRUE), 
         total1 = sum(column1, na.rm = TRUE), 
         total0 = sum(column1 == 0, na.rm = TRUE))

#  names column1 percentage total1 total0
#   <chr>   <dbl>      <dbl>  <dbl>  <int>
# 1 John        1      0.667      2      1
# 2 John        0      0.667      2      1
# 3 John        1      0.667      2      1
# 4 John       NA      0.667      2      1
# 5 Rick        0      0.5        1      1
# 6 Rick        1      0.5        1      1
# 7 Katie       1      0.667      2      1
# 8 Katie       1      0.667      2      1
# 9 Katie       0      0.667      2      1
#10 Harry       1      1          2      0
#11 Harry       1      1          2      0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Here is a solution with base R:

df1 <- data.frame(names=c("John", "John", "John", "John", "Rick", "Rick", "Katie", "Katie", "Katie", "Harry", "Harry" ),
                  column1=c(1,0,1,NA,0,1,1,1,0,1,1))

df1$total1 <- ave(df1$column1, df1$names, FUN=function(x) sum(x, na.rm=TRUE))
df1$total0 <- ave(df1$column1==0, df1$names, FUN=function(x) sum(x, na.rm=TRUE))
df1$percentage <- with(df1, total1/(total1+total0))
df1

and here is a solution with data.table:

library("data.table") 
df1 <- data.table(names=c("John", "John", "John", "John", "Rick", "Rick", "Katie", "Katie", "Katie", "Harry", "Harry" ),
                  column1=c(1,0,1,NA,0,1,1,1,0,1,1))

df1[, ':='(total1=sum(column1, na.rm=TRUE), total0=sum(column1==0, na.rm=TRUE)), names][,percentage:=total1/(total1+total0)][]
jogo
  • 12,469
  • 11
  • 37
  • 42