0

I want to calculate sum(df$columnD==1)/(number of rows in each subset of columnE) but so far I cant even extract sum(df$columnD==1) within each subset of columnE in the following data frame:

set.seed(10)
A <- seq(from=1, to=100, by=1)
B <- runif(100, -5, 0.2)               # actually I have 900,000 rows
C <- runif(100, 0, 1)
D <- rbinom(100, 1, 0.3)
df <- NULL
df$columnA <- A
df$columnB <- B
df$columnC <- C
df$columnD <- D
df <- as.data.frame(df)
df$columnE <- cut(df$columnB, quantile(df$columnB,(0:10)/10), labels=FALSE,
                  include.lowest=TRUE) # https://www.portfolioprobe.com/2012/12/24/miles-of-iles/
index <- order(df$columnE, decreasing = F)
df <- df[index,]

I have tried the following and none works:

sum(df$columnD==1)[df$columnE==1]             # No
df$columnE[df$columnE==1][sum(df$columnD==1)] # Trying to extract only from subset 1
(sum(df$columnD==1)/sum(df$columnE==1))       # Nein

How do I get around this?

pha
  • 316
  • 2
  • 9
  • 2
    If `D` is binary, then `sum(df$columnD == 1)` is the same as `sum(columnD)`, and if you're then dividing by number of rows, it's `mean($columnD)`. So it sounds like you want the mean of `D` by `E`. With base R you can do `df$result = ave(df$columnD, df$columnE)`. With `dplyr` `df %>% group_by(columnE) %>% mutate(result = mean(columnD))`. – Gregor Thomas Sep 28 '20 at 18:52
  • (Above comment assumes you want to add it to the data frame, if you want a summary then `aggregate(columnD ~ columnE, df, mean)` or with `dplyr` replace the `mutate` with `summarize`.) – Gregor Thomas Sep 28 '20 at 18:54

1 Answers1

0

You can use a combination of group_by and summarise. I actually use sum(columnD) because there are only 0s and 1s in this column:

library(dplyr)
df %>% 
  group_by(columnE) %>% 
  summarise(metric = sum(columnD) / n())
# A tibble: 10 x 2
   columnE metric
     <int>  <dbl>
 1       1    0.4
 2       2    0.5
 3       3    0.4
 4       4    0.3
 5       5    0.5
 6       6    0.6
 7       7    0.4
 8       8    0.5
 9       9    0.6
10      10    0.2

Or here with mean (thanks @Gregor Thomas):

df %>% 
  group_by(columnE) %>% 
  summarise(metric = mean(columnD))
starja
  • 9,887
  • 1
  • 13
  • 28