2

Consider following sample data frame.

> ww
  col1 col2
1    1    A
2    2    A
3    3    A
4    4    B
5    5    B
6    6    B
7    7    C
8    8    C
9    9    C
> dput(ww)
structure(list(col1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9), col2 = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("col1", 
"col2"), row.names = c(NA, -9L), class = "data.frame")

I want to know if each category of col2 has different values in col1 or not. In the end, I want to get an answer (TRUE or FALSE). TRUE (if all categories of col2 have completely different sets of values in col1), and FALSE (if there exists atleast 2 categories in col2 which have atleast 1 value in col1 common.

For above example, answer is TRUE since categories A, B and C don't have any value of col1 for them common. Values of col1 are 1,2,3 for A. Values of col1 are 4,5,6 for B. Values of col1 are 7,8,9 for C.

I can try splitting data the data frame by col2 and then saving values of col1 for each member then check for common values using intersect but that is kind of lengthy and inefficient process for a large data frame. Can somebody provide me with an efficient solution? Any data table solution would also do.

user3664020
  • 2,980
  • 6
  • 24
  • 45
  • You could used `duplicated` to see if there are duplicates. I'm assuming two rows of `1 A` followed by `1 A` is the kind of condition you're looking for to create a false return. `duplicated` will return what values are duplicated (not including the original). Using it in conjunction with `any` could give you the functionality you are looking for if I am understanding. – giraffehere Dec 09 '15 at 14:52
  • 1
    Can the categories of `col2` have duplicate values in `col1`, ie., could category A have two 1s? If not, the @giraffehere suggestion is spot on. In fact, you'd want to use `anyDuplicated` – Matthew Plourde Dec 09 '15 at 14:56

3 Answers3

4

You can do this with:

library(data.table)
setDT(ww)[, un.col1 := uniqueN(col1)==.N, by = col2]

which gives:

> ww
   col1 col2 un.col1
1:    1    A   FALSE
2:    1    A   FALSE
3:    3    A   FALSE
4:    4    B    TRUE
5:    5    B    TRUE
6:    6    B    TRUE
7:    1    C    TRUE
8:    8    C    TRUE
9:    9    C    TRUE

Or as @giraffehere mentioned in the comments, a combination of duplicated and any with anyDuplicated:

setDT(ww)[, un.col1 := !anyDuplicated(col1), by = col2]

If you are looking for uniqueness between categories, as @MatthewPlourde suggested, then you can use:

setDT(ww)[, un.col2 := uniqueN(col2) > 1, by = col1]

which gives:

> ww
   col1 col2 un.col2
1:    1    A    TRUE
2:    1    A    TRUE
3:    3    A   FALSE
4:    4    B   FALSE
5:    5    B   FALSE
6:    6    B   FALSE
7:    1    C    TRUE
8:    8    C   FALSE
9:    9    C   FALSE

Used data:

ww <- structure(list(col1 = c(1, 1, 3, 4, 5, 6, 1, 8, 9),
                     col2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")),
                .Names = c("col1", "col2"), row.names = c(NA, -9L), class = "data.frame")
Jaap
  • 81,064
  • 34
  • 182
  • 193
2

My understanding of the OP is that there should be a single TRUE/FALSE value for the entire data frame. Here is a dplyr solution:

library(dplyr)

ww %>%
  group_by(col1) %>%
  summarise(ndis = n_distinct(col2)) %>%
  summarise(all(ndis == 1))
davechilders
  • 8,693
  • 2
  • 18
  • 18
1

If there can only be unique values within a category, then all you need is

anyDuplicated(ww$col1) == 0

If there can be duplicates within a category, remove them with

ww2 <- ww[!duplicated(interaction(ww$col1, ww$col2)), ]

Then use anyDuplicated on col1

Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113