2

I have a simple dataframe with group IDs and elements of each group, like this:

x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))

Each ID may have a different number of elements. Now I want to find all IDs that have distinct elements with other IDs. In this example, ID1 and ID3 will be selected because they have distinct elements (3,5,7 vs 2,4,6). I also want to copy these unique IDs and their elements into a new dataframe, similar to the original.

How would I do that in R? My skills with R is quite limited.

Thank you very much!

Bests,

Len Lab
  • 123
  • 4
  • Welcome to SO. To get help on this site it's best to include your data in a way that's easy for someone to copy and paste into R, something like `dput(data)`. You should also include an example of what your desired outcome looks like, as it can be tricky to tell just from reading. Also, include what code you have tried so far. check out [this link](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more info. – astrofunkswag Apr 04 '20 at 19:13

3 Answers3

1

Seems like a good question for igraph cliques with one edge to another clique but I cant seem to wrap my head on how to use it.

Anyway, here is an option applying join to identify IDs with same Values and then anti-join to remove those IDs using data.table:

library(data.table)
DT <- as.data.table(x)
for (i in DT[, unique(ID)]) {
    dupeID <- DT[DT[ID==i], on=.(Values), .(ID=unique(x.ID[x.ID!=i.ID]))]
    DT <- DT[!dupeID , on=.(ID)]
}

output:

   ID Values
1:  1      3
2:  1      5
3:  1      7
4:  3      2
5:  3      4
6:  3      6
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • This works. I tried with a large dataset and it gave me IDs with distinct values. Thanks a lot! – Len Lab Apr 07 '20 at 17:28
0

You can try the following code, where the y is the list of data frames (including all data frames that have exclusive Value)

xs <- split(x,x$ID)
id <- names(xs)
y <- list()
ids <- seq_along(xs)
repeat {
  if (length(ids)==0) break;
  y[[length(y)+1]] <- xs[[ids[1]]]
  p <- ids[[1]]
  qs <- p
  for (q in ids[-1]) {
    if (length(intersect(xs[[p]]$Value,xs[[q]]$Value))==0) {
      y[[length(y)]] <- rbind(y[[length(y)]],xs[[q]])
      qs <- c(qs,q)
    }
  }
  ids <- setdiff(ids,qs)
}

Example

x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3,4,4), 
                "Values" = c(3,5,7,2,4,5,2,4,6,1,3))

> x
   ID Values
1   1      3
2   1      5
3   1      7
4   2      2
5   2      4
6   2      5
7   3      2
8   3      4
9   3      6
10  4      1
11  4      3

then you will get

> y
[[1]]
  ID Values
1  1      3
2  1      5
3  1      7
7  3      2
8  3      4
9  3      6

[[2]]
   ID Values
4   2      2
5   2      4
6   2      5
10  4      1
11  4      3
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Thanks a lot! Similar to what I wrote for James Curran's answer, the result shows me which IDs are really distinct compare to ID1, but these IDs may have overlapping values. I hope my comment makes sense. – Len Lab Apr 04 '20 at 21:55
  • @LenLab could you show the expected output regarding your comment? – ThomasIsCoding Apr 04 '20 at 22:00
  • Hi, I put a text online where I show the output. Please check it here: https://anotepad.com/notes/xqra9kdk – Len Lab Apr 05 '20 at 09:59
0
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
gps = split(x, x$ID)
nGroups = length(gps)

k = 1
results = data.frame(ID = NULL, Values = NULL) 

for(i in 1:(nGroups - 1)){
  j = i + 1
  while(j <= nGroups){
    if(length(intersect(gps[[i]]$Values, gps[[j]]$Values)) == 0){
      print(c(i,j))
      results = rbind(results, gps[[i]], gps[[j]]) 
    }
    j = j + 1
  }
}
results
> results
  ID Values
1  1      3
2  1      5
3  1      7
7  3      2
8  3      4
9  3      6
James Curran
  • 1,274
  • 7
  • 23
  • As I understood, the code goes through each ID and compare its values with all other IDs. But I am not sure this is what I need. For example, the result shows me which IDs are really distinct compare to ID1, but these IDs may have overlapping values. I hope my comment makes sense. – Len Lab Apr 04 '20 at 21:45
  • I think it does what you ask, however, it does not (as it stands) check to see whether the result set already contains the values stored in for the two IDs in question. – James Curran Apr 04 '20 at 21:50