Filter by number of duplicates, in R

Question

I am utilising a dataset that includes one column that comprises of phone numbers, and one column that includes yes or no.

  phone.number yes.no
1   7539339393      Y
2    111111111      N
3   1234567890      Y
4   7539339393      N
5   1222222222      Y
6   3333333333      N
7   4444444444      Y
8   5555555555      N
9   7539339393      Y

dput below:

structure(list(phone.number = structure(c(7L, 1L, 3L, 7L, 2L, 
    4L, 5L, 6L, 7L), .Label = c("111111111", "1222222222", "1234567890", 
    "3333333333", "4444444444", "5555555555", "7539339393"), class = "factor"), 
        yes.no = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("N", 
        "Y"), class = "factor")), .Names = c("phone.number", "yes.no"
    ), row.names = c(NA, -9L), class = "data.frame")

There are two things that I'd really appreciate some help with.

I am trying to create a subset that includes all phone numbers that have been repeated 3 or more times within the dataset. So, in the example provided, the number 7539339393 is repeated 3 times, the subset that I'd like to make would just include these observations and the corresponding yes and no values.
I am also trying to subset for all phone numbers that have returned yes and no. So, within the dataset provided, 7539339393 has corresponded to both yes and no, I'd like to create a subset with all of the phone numbers that return both and all of the corresponding yes and no values

Hi, welcome to SO. Please provide a reproducible example. This includes copy-pasting some data (nobody is going to type in the data from your image) and what you've already tried. — Heroka, Sep 14 '15 at 13:22
Hi Heroka, I'll do that now, thanks for letting me know (still new here) — Lyriss, Sep 14 '15 at 13:24
Unless I'm missing something, your link goes to an image. Not useful. *Provide your data*. Either upload as, say, csv, and provide a link, or (better) import into R and provide the output of `dput(mydata)`, directly in your quesiton. — jlhoward, Sep 14 '15 at 13:31
I've editted the question with an update, is this more helpful? — Lyriss, Sep 14 '15 at 13:35
Well done with the edit and inclusion of the data! What have you already tried yourself to fix your problem? — Heroka, Sep 14 '15 at 13:38
For 1) try `df[df$phone.number %in% names(which(table(df$phone.number) >= 3)),]`. For 2) `df[as.logical(ave(as.character(df$yes.no), df$phone.number, FUN= function(x) all(c("Y", "N") %in% x))),]` — Pierre L, Sep 14 '15 at 13:43
To be honest, I'm a bit stumped. I've been googling left right and centre, but I can't seem to make sense of it. I guess my issue is that I know how to analyse something within a row, but between rows seems confusing. — Lyriss, Sep 14 '15 at 13:44

Pierre L · Accepted Answer · 2015-09-14T13:51:53.170

In base R,

For 1:

df[df$phone.number %in% names(which(table(df$phone.number) >= 3)),]
#  phone.number yes.no
#1   7539339393      Y
#4   7539339393      N
#9   7539339393      Y

We subset based on phone numbers that appear 3 or more times. table takes a tally of all elements in the vector.

For 2:

df[as.logical(ave(as.character(df$yes.no), df$phone.number, FUN= function(x) all(c("Y", "N") %in% x))),]
#  phone.number yes.no
#1   7539339393      Y
#4   7539339393      N
#9   7539339393      Y

There are nice grouping functions in dplyr and data.table, this is one base R approach that may either highlight the strength of base R or the advantage of other packages depending on how you look at it. It is a bit verbose as a one liner. Working from the inside out, all is used to check if yes and no appear. The function ave helps to complete the operation by unique phone number.

score 1 · Answer 2 · answered Sep 14 '15 at 13:44

Possible dplyr approaches..

For 1.:

df %>% group_by(phone.number) %>% filter(n() >= 3)
#Source: local data frame [3 x 2]
#Groups: phone.number [1]
#
#  phone.number yes.no
#        (fctr) (fctr)
#1   7539339393      Y
#2   7539339393      N
#3   7539339393      Y

(n() returns the number of rows per group in dplyr)

For 2.:

df %>% group_by(phone.number) %>% filter(all(c("Y", "N") %in% yes.no))
#Source: local data frame [3 x 2]
#Groups: phone.number [1]
#
#  phone.number yes.no
#        (fctr) (fctr)
#1   7539339393      Y
#2   7539339393      N
#3   7539339393      Y

For 1. + 2. combined:

df %>% group_by(phone.number) %>% filter(n() >= 3 & all(c("Y", "N") %in% yes.no))
#Source: local data frame [3 x 2]
#Groups: phone.number [1]
#
#  phone.number yes.no
#        (fctr) (fctr)
#1   7539339393      Y
#2   7539339393      N
#3   7539339393      Y

jlhoward · Answer 3 · 2015-09-14T14:00:38.777

0

Here's a data.table solution, calling your dataset df:

library(data.table)
setDT(df)[,.SD[.N>2],by=phone.number]
#    phone.number yes.no
# 1:   7539339393      Y
# 2:   7539339393      N
# 3:   7539339393      Y

For the second part:

df[,.SD[length(unique(yes.no))>1], by=phone.number]

Note that setDT(df) converts df to a data.table, so you don't have to call it again in the second part.

This approach is likely to be faster than any other method, although unless your data set is extremely large you might not notice the difference.

edited Sep 14 '15 at 14:00

answered Sep 14 '15 at 13:55

jlhoward

58,004
7
97
140

The second part might not work as expected if there could be `NA` entries in the column (not clear from the example but just in case). – talat Sep 14 '15 at 14:14

Filter by number of duplicates, in R

3 Answers3