4

I have a dataset that consists of two columns: one consists of a list of names and the other identifies a group to which the corresponding name belongs (example data and code for reproducing example data below).

group   name
1   aaa
1   aaa
1   aaa
2   NA
2   NA
3   NA
3   bbb
4   ccc
4   ccc
5   ddd
5   eee
6   fff
6   ffg

> x <- c(1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
> y <- c("aaa", "aaa", "aaa", NA, NA, NA, "bbb", "ccc", "ccc", "ddd", "eee", "fff", "ffg")
> mydataframe <- data.frame( "group" = x, "name" = y)

The dataset consists of more than 500,000 rows (as well as other columns which are irrelevant to this question). Groups may include may have one member (one row) or many members (multiple rows). When a group has multiple members, the names of the members may be: a) the same (e.g., group 1 where all members are named "aaa"); b) blank (e.g., group 2 where all members are NA); c) a mix of names and blanks (e.g., group 3 that includes one blank name and one name "bbb"); or d) different names (e.g., group 5 that has "ddd" and "eee"). When there are different names they may be very different (e.g., group 5) or similar (group 6)

I need to filter the dataset to identify rows that fall into category d - those rows that have the same group number but different (non-NA) names (in the example below, groups 5 and 6).

The filter would produce a new dataset that looks looks like this:

group   name
5   ddd
5   eee
6   fff
6   ffg

Bonus points if someone can help me not only produce the result above, but also distinguish between groups that have members with totally different names (group 5) and groups that have members with similar, but different names (group 6).

Thank you in advance!

talat
  • 68,970
  • 21
  • 126
  • 157
David S
  • 107
  • 5

2 Answers2

5

Here's a possible data.table v >= 1.9.5 solution

library(data.table)
setDT(mydataframe)[, .SD[uniqueN(na.omit(name)) == .N], by = group]
#    group name
# 1:     5  ddd
# 2:     5  eee
# 3:     6  fff
# 4:     6  ffg

Basically we are comparing the number of unique groups (while NAs removed) to the actual group size (.N) and subsetting it out of .SD (which stands for Sub Data) which is our actual data set subseted by condition.


Alternatively with dplyr where we using n_distinct and n() instead (for some reason this solution seem not to work without converting to data.table object first using setDT(mydataframe), so if you haven't done so yet, do it before running this code)

library(dplyr)
mydataframe %>%
  group_by(group) %>%
  filter(n_distinct(na.omit(name)) == n())

# Source: local data table [4 x 2]
# Groups: group
# 
#   group name
# 1     5  ddd
# 2     5  eee
# 3     6  fff
# 4     6  ffg

For the bonus points you could also try agrep (you can set the max.distance as you wish- 1 is the default). 1- means similar, NA- means no similarity found.

mydataframe %>%
  group_by(group) %>%
  filter(n_distinct(na.omit(name)) == n()) %>%
  mutate(similarity = agrep(name[1L], name[-1L]))

# Source: local data table [4 x 3]
# 
#   group name similarity
# 1     5  ddd         NA
# 2     5  eee         NA
# 3     6  fff          1
# 4     6  ffg          1
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks. I'm getting errors with both approaches. The data.table approach returns: Error in eval(expr, envir, enclos) : could not find function "uniqueN". The dplyr approach returns: "Error: Input to n_distinct() must be a single variable name from the data set" Any thoughts? I'd love to be able to use the dplyr method so I can pick up your agrep solution as well. – David S May 03 '15 at 16:40
  • 2
    I've mentioned that for the `data.table` solution you will need to download it from GitHub (v 1.9.5). Re the `dplyr` solution I've also mentioned that you will need to convert your data to `data.table` first. Run `setDT(mydataframe)` and then run the `dplyr` solution. – David Arenburg May 03 '15 at 17:46
  • 1
    @DavidS Fyi, `agrep` and `agrepl` are base functions, so you can use that part of DavidA's answer anywhere. For example, `res <- udt[multrows]; res[,sim:=agrepl(name[1L],name[2L]),by=group]`. I guess this step only makes sense if you have exactly two names per group. I didn't mention it in my answer because I figured this answer had nailed the right approach to that issue already. – Frank May 03 '15 at 20:25
  • 1
    @Frank it's not limited to the number of members it just compares everything against the first observation (which I figured suits the OP). And I also not sure what you mean by "*nailed the right approach*". There are many possible approaches to address any problem, and the fact that yours didn't require `data.table` v 1.9.5 doesn't make it "*right*". – David Arenburg May 03 '15 at 20:34
  • @DavidArenburg Oh, I see -- that's why it's `-1L`. Let me explain: I think that `agrep` is the right approach for the definition of similarity the OP had in mind. (So the comment is not a comparison with my answer at all, except that my answer lacks that "bonus" ... but, for the record, I think your answer is fine and didn't add my own thinking it was any sort of improvement on yours.) – Frank May 03 '15 at 20:46
  • Thanks to DavidArenburg and Frank. @DavidArenburg: apologies for not reading your preamble and jumping right to the code. My error. You've both helped me a lot and I've learned something new about both dplyr and data.table. – David S May 04 '15 at 12:40
3

Here's another data.table approach:

require(data.table)
udt <- data.table(na.omit(unique(mydataframe)))

multrows <- udt[,.I[.N>1],by=group]$V1
udt[multrows]

How it works: .N is the number of rows in the group, equal to the number of names; and .I is the row number in udt. So, the new variable .I[.N>1] is either (i) empty if there is only one name (.N==1) or (ii) the row numbers for the group if there are multiple names.

The new variable is named V1 by default, so $V1 extracts it. And we can subset udt by selecting these relevant row numbers.


I copied this method from an answer by @eddi. The two steps can also be combined: udt[udt[,.I[.N>1],by=group]$V1].

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180