Easiest way to create unique, transitive, ids based on existing ids

Question

Suppose I have a dataset that looks like the following:

obs id1 id2
1   a   1
2   b   2
3   c   2
4   d   3
5   e   4
6   b   5
7   f   6

I want to create a unique transitive id variable that for this dataset. Both id1 and id2 are used to identify individuals. So if individual X has the same id1 as individual Y or the same id2 as individual Y, then X=Y.

So, in this example, the intended output would look like this:

obs id1 id2 uniqid
1   a   1   1
2   b   2   2
3   c   2   2
4   d   3   3
5   e   4   4
6   b   5   2
7   f   6   5

Here, observation 6 has id1 "b", which was already assigned uniqid 2 (by observation 2), and so, observation 6 identifies the same individual as observation 2.

Now, comparing observation 3 and 6, we see that these observations share neither id1 nor id2, but still identifies the same individual, since they both identify the same individual as observation 2.

I am currently working in Stata and I was wondering what is the best way to go about doing this. I would prefer a Stata based solution, but I would also be interested in seeing R or Python solutions.

Are these all the same? `a 1`, `b 1`, `b 2`, `c 2`. The first has nothing in common to the last one, but by transitivity the should share the same id. — R. Schifini, Apr 20 '19 at 19:12
Try `library(igraph); g <- graph_from_data_frame(dat[, -1]); cg <- clusters(g)$membership; dat$unique_id <- cg[dat$id1]; dat` Where `dat` is the name of your data. — markus, Apr 20 '19 at 19:51
Related / dupe: [Make a group_indices based on several columns](https://stackoverflow.com/questions/45079559/make-a-group-indices-based-on-several-columns) — markus, Apr 20 '19 at 19:58

score 2 · Answer 1 · 2019-04-21T13:04:03.143

The community-contributed Stata command group_id does what you want.

However, it leaves the non-matched identifiers unaltered:

clear
input obs str1 id1 id2
1   a   1
2   b   2
3   c   2
4   d   3
5   e   4
6   b   5
7   f   6
end

clonevar id = id2 
group_id id, match(id1)
sort obs 

list, separator(0)

     +----------------------+
     | obs   id1   id2   id |
     |----------------------|
  1. |   1     a     1    1 |
  2. |   2     b     2    2 |
  3. |   3     c     2    2 |
  4. |   4     d     3    3 |
  5. |   5     e     4    4 |
  6. |   6     b     5    2 |
  7. |   7     f     6    6 |
     +----------------------+

Here's another example to see that this is indeed the case:

clear
input obs str1 id1 id2
1   a   1
2   b   2
3   c   7
4   d   3
5   e   4
6   b   5
7   f   1
end

clonevar id = id2 
group_id id, match(id1)
sort obs

list, separator(0)

     +----------------------+
     | obs   id1   id2   id |
     |----------------------|
  1. |   1     a     1    1 |
  2. |   2     b     2    2 |
  3. |   3     c     7    7 |
  4. |   4     d     3    3 |
  5. |   5     e     4    4 |
  6. |   6     b     5    2 |
  7. |   7     f     1    1 |
     +----------------------+

If you want to have continuous identifiers after the match (that is 5 instead of 6 in the first example and 7 in the second), you will have to adjust those manually.

score 1 · Accepted Answer · answered Apr 21 '19 at 15:27

The community-contributed group_twoway (available in SSC) is a direct solution for your issue. You should make sure that id1 and id2 having the same type (numeric or string) and there are no overlap values in their range. Below code works for your example.

ssc install group_twoway
```
gen id2_str = string(id2)
group_twoway id1 id2_str, gen(id)
```
One more step added to Pearly's solution (with group_id) would make it serving for your issue without manual adjustment.
```
clonevar x = id1 
group_id x, match(id2)

egen id = group(x)
```

If you might feel these packages are too "black box", below code is the ... coding language for your logic desire.

egen x0 = group(id1)
egen x1 = min(x0), by(id2)

local i=1
while r(N) <_N {
egen x`=`i'+1' = min(x`i'), by(x`=`i'-1')
count if x`i++' == x`i'
}

egen id = group(x`i')
drop x*

By the way, using `egen` after `group_id` _is_ manual adjustment. — , Apr 21 '19 at 15:54
Thanks for your encouragement. Both `group-twoway` and `group_id` are serving well for the issue. — Romalpa Akzo, Apr 21 '19 at 23:02

Easiest way to create unique, transitive, ids based on existing ids

2 Answers2