Conditional filtering and summarizing in R

Question

I have recently transitioned from STATA + Excel to R. So, I would appreciate if someone could help me in writing efficient code. I have tried my best to research the answer before posting on SO.

Here's how my data looks like:

mydata<-data.frame(sassign$buyer,sassign$purch,sassign$total_)
str(mydata)
'data.frame':   50000 obs. of  3 variables:
 $ sassign.buyer : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 2 1 ...
 $ sassign.purch : num  10 3 2 1 1 1 1 11 11 1 ...
 $ sassign.total_: num  357 138 172 272 149 113 15 238 418 123 ...
head(mydata)
  sassign.buyer sassign.purch sassign.total_
1            no            10            357
2            no             3            138
3            no             2            172
4            no             1            272
5            no             1            149
6           yes             1            113

My objective is to find average number of buyers with # of purchases > 1.

So, here's what I did:

Method 1: Long method

library(psych)
check<-as.numeric(mydata$sassign.buyer)-1
myd<-cbind(mydata,check)
abcd<-psych::describe(myd[myd$sassign.purch>1,])
abcd$mean[4]

The output I got is:0.1031536697, which is correct.

@Sathish: Here's how check looks like:

head(check)
0 0 0 0 0 1

This did solve my purpose.

Pros of this method: It's easy and typically a beginner level. Cons: Too many-- I need an extra variable (check). Plus, I don't like this method--it's too clunky.

Side Question : I realized that by default, functions don't show higher precision although options (digits=10) is set. For instance, here's what I got from running :

psych::describe(myd[myd$sassign.purch>1,])


               vars     n   mean     sd median trimmed    mad min max range skew
sassign.buyer*    1 34880   1.10   0.30      1    1.00   0.00   1   2     1 2.61
sassign.purch     2 34880   5.14   3.48      4    4.73   2.97   2  12    10 0.65
sassign.total_    3 34880 227.40 101.12    228  226.13 112.68  30 479   449 0.09
check             4 34880   0.10   0.30      0    0.00   0.00   0   1     1 2.61
               kurtosis   se
sassign.buyer*     4.81 0.00
sassign.purch     -1.05 0.02
sassign.total_    -0.72 0.54
check              4.81 0.00

It's only when I ran

abcd$mean[4]

I got 0.1031536697

Method 2: Using dplyr I tried pipes and function call, but I finally gave up.

Method 2 | Try1: psych::describe(dplyr::filter(mydata,mydata$sassign.purch>1)[,dplyr::mutate(as.numeric(mydata$sassign.buyer)-1)])

Output:

Error in UseMethod("mutate_") : 
  no applicable method for 'mutate_' applied to an object of class "c('double', 'numeric')"

Method 2 | Try2: Using pipes:

mydata %>% mutate(newcol = as.numeric(sassign.buyer)-1) %>% dplyr::filter(sassign.purch>1) %>% summarise(meanpurch = mean(newcol))

This did work, and I got meanpurch= 0.1031537. However, I am still not sure about Try 1.

Any thoughts why this isn't working?

Please try to [make this post reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — shayaa, Jul 31 '16 at 00:48
Shayaa--I have edited the code..I hope this is reproducible now. Please let me know... — watchtower, Jul 31 '16 at 01:29
Sathish, Thanks for your reply. I have posted the output of head(check). Please let me know if you have questions. — watchtower, Jul 31 '16 at 01:37
Sathish, Thanks again for your quick response. I believe I am getting error in the last part of your command. — watchtower, Jul 31 '16 at 01:43

Sathish · Accepted Answer · 2016-07-31T01:35:58.743

2

Data:

> dt
# sassign.buyer sassign.purch sassign.total_
# 1            no            10            357
# 2            no             3            138
# 3            no             2            172
# 4            no             1            272
# 5            no             1            149
# 6           yes             1            113

Number of Buyers with purchases greater than 1

library(dplyr)

dt %>% 
  group_by(sassign.buyer) %>% 
  filter(sassign.purch > 1) 

# 
# Source: local data frame [3 x 3]
# Groups: sassign.buyer [1]
# 
# sassign.buyer sassign.purch sassign.total_
# (chr)         (int)          (int)
# 1            no            10            357
# 2            no             3            138
# 3            no             2            172

Average number of buyers with purchases greater than 1

dt %>% 
  group_by(sassign.buyer) %>% 
  filter(sassign.purch > 1) %>% 
  summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/ nrow(dt))

# Source: local data frame [1 x 2]
# 
#       sassign.buyer avg_no_buyers_gt_1
#         (chr)              (dbl)
# 1            no             0.5

If no grouping of buyers is required,

dt %>%
  filter(sassign.purch > 1) %>% 
  summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/ nrow(dt))

#   avg_no_buyers_gt_1
# 1          0.7777778

edited Jul 31 '16 at 01:35

answered Jul 31 '16 at 01:27

Sathish

12,453
3
41
59

1

Hello Sathish, Thanks for your help, but when I ran the last part of the command mydata %>% filter(mydata$sassign.purch > 1) %>% summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/nrow(mydata)), I get an error: "Error in UseMethod("summarise_") : no applicable method for 'summarise_' applied to an object of class "c('mts', 'ts')" Could you please help me? – watchtower Jul 31 '16 at 01:42
Yes, when I tried without grouping...I get an error. I am using mydata data frame. I'd appreciate your help. I have added a comment to your screenshot. – watchtower Jul 31 '16 at 01:46
1

Try this: `mydata %>% filter(sassign.purch > 1) %>% summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/nrow(mydata))` – Sathish Jul 31 '16 at 01:48
1

Thanks Sathish! This worked...:). I will mark Simon's method as an answer just because it's short and sweet. I hope you won't mind it. I personally like your method, but I am thinking about someone who is visiting SO might be better off with Simon's method because it is fast. Thanks again for your help. – watchtower Jul 31 '16 at 01:51
Oh yes! You are correct. You will get the credit. My apologies. My brain is fried at the moment. – watchtower Jul 31 '16 at 01:55
Another option is to use the last suggestion but replace `length(sassign.buyer)/ nrow(dt)` with `mean(sassign.buyer == "yes")`, to get proportion of cases with more than one purchase who are also buyers. – Simon Jackson Jul 31 '16 at 03:08
@SimonJackson I had the solution this way, because, at that time. I was guessing the OP's requirement, whether grouping by buyer or not grouping by buyer. Your suggestion is valid when there is no grouping by buyer. – Sathish Jul 31 '16 at 03:41

score 2 · Answer 2 · answered Jul 31 '16 at 01:27

2

Finding the proportion of cases that suit a condition is easy to do with mean(). Here's a blog post explaining it: https://drsimonj.svbtle.com/proportionsfrequencies-with-mean-and-booleans, and here's a simple example:

buyer <- c("yes", "yes", "no", "no")
mean(buyer == "yes")
#> [1] 0.5

So in your case, you can do mean(d$sassign.buyer[d$sassign.purch > 1] == "yes"). Here's a worked example:

d <- data.frame(
  sassign.buyer = factor(c("yes", "yes", "no", "no")),
  sassign.purch = c(1, 10, 0, 200)
)
mean(d$sassign.buyer[d$sassign.purch > 1] == "yes")
#> [1] 0.5

This gets all cases where d$sassign.purch is greater han 1, and then computes the proportion (using mean()) of these cases in which d$sassign.buyer is equal to "yes".

answered Jul 31 '16 at 01:27

Simon Jackson

3,134
15
24

Simon, thank you for your help. I have marked Sathish's response as the answer because I was looking for dplyr method. I like your method because it's extremely fast and efficient. I hope you don't mind it. I thought of informing you. I am also going through your blog. So far, it's been fantastic! – watchtower Jul 31 '16 at 01:56
1

Thanks, @watchtower. Of course, Sathish's response is good. I'll add one comment to it that suggests using mean(). And thanks re blogR feedback! – Simon Jackson Jul 31 '16 at 03:04

Conditional filtering and summarizing in R

2 Answers2