unique values of one column for unique values of another column

Question

Consider following sample data frame.

> ww
  col1 col2
1    1    A
2    2    A
3    3    A
4    4    B
5    5    B
6    6    B
7    7    C
8    8    C
9    9    C
> dput(ww)
structure(list(col1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9), col2 = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("col1", 
"col2"), row.names = c(NA, -9L), class = "data.frame")

I want to know if each category of col2 has different values in col1 or not. In the end, I want to get an answer (TRUE or FALSE). TRUE (if all categories of col2 have completely different sets of values in col1), and FALSE (if there exists atleast 2 categories in col2 which have atleast 1 value in col1 common.

For above example, answer is TRUE since categories A, B and C don't have any value of col1 for them common. Values of col1 are 1,2,3 for A. Values of col1 are 4,5,6 for B. Values of col1 are 7,8,9 for C.

I can try splitting data the data frame by col2 and then saving values of col1 for each member then check for common values using intersect but that is kind of lengthy and inefficient process for a large data frame. Can somebody provide me with an efficient solution? Any data table solution would also do.

You could used `duplicated` to see if there are duplicates. I'm assuming two rows of `1 A` followed by `1 A` is the kind of condition you're looking for to create a false return. `duplicated` will return what values are duplicated (not including the original). Using it in conjunction with `any` could give you the functionality you are looking for if I am understanding. — giraffehere, Dec 09 '15 at 14:52
Can the categories of `col2` have duplicate values in `col1`, ie., could category A have two 1s? If not, the @giraffehere suggestion is spot on. In fact, you'd want to use `anyDuplicated` — Matthew Plourde, Dec 09 '15 at 14:56

Jaap · Accepted Answer · 2015-12-09T16:00:27.757

4

You can do this with:

library(data.table)
setDT(ww)[, un.col1 := uniqueN(col1)==.N, by = col2]

which gives:

> ww
   col1 col2 un.col1
1:    1    A   FALSE
2:    1    A   FALSE
3:    3    A   FALSE
4:    4    B    TRUE
5:    5    B    TRUE
6:    6    B    TRUE
7:    1    C    TRUE
8:    8    C    TRUE
9:    9    C    TRUE

Or as @giraffehere mentioned in the comments, a combination of duplicated and any with anyDuplicated:

setDT(ww)[, un.col1 := !anyDuplicated(col1), by = col2]

If you are looking for uniqueness between categories, as @MatthewPlourde suggested, then you can use:

setDT(ww)[, un.col2 := uniqueN(col2) > 1, by = col1]

which gives:

> ww
   col1 col2 un.col2
1:    1    A    TRUE
2:    1    A    TRUE
3:    3    A   FALSE
4:    4    B   FALSE
5:    5    B   FALSE
6:    6    B   FALSE
7:    1    C    TRUE
8:    8    C   FALSE
9:    9    C   FALSE

Used data:

ww <- structure(list(col1 = c(1, 1, 3, 4, 5, 6, 1, 8, 9),
                     col2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")),
                .Names = c("col1", "col2"), row.names = c(NA, -9L), class = "data.frame")

edited Dec 09 '15 at 16:00

answered Dec 09 '15 at 14:50

Jaap

81,064
34
182
193

1

I don't believe this is what OP is looking for. As I understand it, he/she is looking for uniqueness between categories, not within. – Matthew Plourde Dec 09 '15 at 15:04
Mathew is correct. I am looking for uniqueness between categories. – user3664020 Dec 09 '15 at 15:14
@Jaap, I believe OP wants one boolean indicating whether all the categories have their own unique set of values. (not my DV) – Matthew Plourde Dec 09 '15 at 15:21
1

`setDT(ww)[, .(un = uniqueN(col2)), by=col1][, all(un == 1L)]` I think should work.. – Arun Dec 09 '15 at 15:27
@akrun I'm not duping your posts because it's you. – Jaap Dec 10 '15 at 13:25
@Jaap It's okay, but this seems to be a dupe. If not, you can reopen it. – akrun Dec 10 '15 at 13:25
@akrun Why did you delete your comment? Afraid someone could read it? – Jaap Dec 10 '15 at 13:37
@Jaap Not really. I just deleted it because it was not a comment to your answer, but about the dupe. – akrun Dec 10 '15 at 13:40

davechilders · Answer 2 · 2015-12-09T15:19:23.377

2

My understanding of the OP is that there should be a single TRUE/FALSE value for the entire data frame. Here is a dplyr solution:

library(dplyr)

ww %>%
  group_by(col1) %>%
  summarise(ndis = n_distinct(col2)) %>%
  summarise(all(ndis == 1))

edited Dec 09 '15 at 15:19

answered Dec 09 '15 at 15:10

davechilders

8,693
2
18
18

I think this is still missing the point. My reading of OP's post is that s/he wants to check for uniqueness between categories, not within. – Matthew Plourde Dec 09 '15 at 15:14
@MatthewPlourde - my solution does check for uniqueness *between* categories. – davechilders Dec 09 '15 at 15:17
I think OP just needs to clarify a little bit more about what he/she wants. – giraffehere Dec 09 '15 at 15:27

Matthew Plourde · Answer 3 · 2015-12-09T15:23:19.447

1

If there can only be unique values within a category, then all you need is

anyDuplicated(ww$col1) == 0

If there can be duplicates within a category, remove them with

ww2 <- ww[!duplicated(interaction(ww$col1, ww$col2)), ]

Then use anyDuplicated on col1

edited Dec 09 '15 at 15:23

answered Dec 09 '15 at 14:59

Matthew Plourde

43,932
7
96
113

unique values of one column for unique values of another column

3 Answers3