group cases by shared values in r

Question

I have a dataset like this:

I would like to create a grouping variable. This variable should have the same values when both x and y are the same. I do not care what this value is but it is to group them. Because in my dataset if x and y are the same for two cases they are probably part of the same organization. I want to see which organizations there are.

So my preferred dataset would look like this:

    case x y org
      1  4 5  1
      2  4 5  1
      3  8 9  2
      4  7 9  3
      5  6 3  4 
      6  6 3  4

How would I have to program this in R?

score 2 · Answer 1 · answered Oct 03 '17 at 00:26

2

As you said , I do not care what this value is, you can just do following

dt$new=as.numeric(as.factor(paste(dt$x,dt$y)))
dt
  case x y new
1    1 4 5   1
2    2 4 5   1
3    3 8 9   4
4    4 7 9   3
5    5 6 3   2
6    6 6 3   2

answered Oct 03 '17 at 00:26

BENY

317,841
20
164
234

www · Accepted Answer · 2017-10-04T13:41:41.093

1

A solution from dplyr using the group_indices.

library(dplyr)

dt2 <- dt %>%
  mutate(org = group_indices(., x, y))

dt2
  case x y org
1    1 4 5   1
2    2 4 5   1
3    3 8 9   4
4    4 7 9   3
5    5 6 3   2
6    6 6 3   2

If the group numbers need to be in order, we can use the rleid from the data.table package after we create the org column as follows.

library(dplyr)
library(data.table)

dt2 <- dt %>%
  mutate(org = group_indices(., x, y)) %>%
  mutate(org = rleid(org))
dt2
  case x y org
1    1 4 5   1
2    2 4 5   1
3    3 8 9   2
4    4 7 9   3
5    5 6 3   4
6    6 6 3   4

Update

Here is how to arrange the columns in dplyr.

library(dplyr)

dt %>%
  arrange(x)
  case x y
1    1 4 5
2    2 4 5
3    5 6 3
4    6 6 3
5    4 7 9
6    3 8 9

We can also do this for more than one column, such as arrange(x, y) or use desc to reverse the oder, like arrange(desc(x)).

DATA

dt <- read.table(text = " case x y 
      1  4 5  
                 2  4 5  
                 3  8 9 
                 4  7 9
                 5  6 3
                 6  6 3",
                 header = TRUE)

edited Oct 04 '17 at 13:41

answered Oct 03 '17 at 00:18

www

38,575
12
48
84

If you are using `rleid` better `arrange` the dt before mutate the new column ~ :) – BENY Oct 03 '17 at 00:28
1

@Wen I feel like `arrange` may not be what the OP wants since OP's example output does not considering the order of both of `x` and `y`. But still valuable information for the OP to consider and think about. – www Oct 03 '17 at 00:30
dude, try this example , you will got what i mean ...A,A,B,A,C,D.rleid will return 1,1,2,3,4,5 – BENY Oct 03 '17 at 00:34
1

@Wen Thanks for pointing that out. I was not thinking about that. I have updated my code, using the `group_indices` from `dplyr`, but not I am thinking how to exactly generate the output the OP wants. – www Oct 03 '17 at 00:39
`group_indices` great experience on `dplyr` :) upvoted – BENY Oct 03 '17 at 00:39
1

@Wen Thanks for the upvote. I think perhaps we can use `rleid` after `group_indices`. That will generate the exact output the OP wants and avoid the hypothetical situation you pointed out. – www Oct 03 '17 at 00:43
thank you guys! I indeed encounter the problem Wen pointed out. How to overcome this? How did you update your code? I now managed to categories my data so thank you so much for that. But it would be even prettier to get it arranged. – Boaz Kaarsemaker Oct 04 '17 at 03:29
@BoazKaarsemaker Not sure if this is what you want, but please see my updates. – www Oct 04 '17 at 13:42

group cases by shared values in r

2 Answers2