0

Suppose I have a dataset that looks like the following:

obs id1 id2
1   a   1
2   b   2
3   c   2
4   d   3
5   e   4
6   b   5
7   f   6

I want to create a unique transitive id variable that for this dataset. Both id1 and id2 are used to identify individuals. So if individual X has the same id1 as individual Y or the same id2 as individual Y, then X=Y.

So, in this example, the intended output would look like this:

obs id1 id2 uniqid
1   a   1   1
2   b   2   2
3   c   2   2
4   d   3   3
5   e   4   4
6   b   5   2
7   f   6   5

Here, observation 6 has id1 "b", which was already assigned uniqid 2 (by observation 2), and so, observation 6 identifies the same individual as observation 2.

Now, comparing observation 3 and 6, we see that these observations share neither id1 nor id2, but still identifies the same individual, since they both identify the same individual as observation 2.

I am currently working in Stata and I was wondering what is the best way to go about doing this. I would prefer a Stata based solution, but I would also be interested in seeing R or Python solutions.

Atom Vayalinkal
  • 2,642
  • 7
  • 29
  • 37
  • Are these all the same? `a 1`, `b 1`, `b 2`, `c 2`. The first has nothing in common to the last one, but by transitivity the should share the same id. – R. Schifini Apr 20 '19 at 19:12
  • Yes, these are all the same. – Atom Vayalinkal Apr 20 '19 at 19:16
  • Try `library(igraph); g <- graph_from_data_frame(dat[, -1]); cg <- clusters(g)$membership; dat$unique_id <- cg[dat$id1]; dat` Where `dat` is the name of your data. – markus Apr 20 '19 at 19:51
  • Related / dupe: [Make a group_indices based on several columns](https://stackoverflow.com/questions/45079559/make-a-group-indices-based-on-several-columns) – markus Apr 20 '19 at 19:58

2 Answers2

2

The community-contributed Stata command group_id does what you want.

However, it leaves the non-matched identifiers unaltered:

clear
input obs str1 id1 id2
1   a   1
2   b   2
3   c   2
4   d   3
5   e   4
6   b   5
7   f   6
end

clonevar id = id2 
group_id id, match(id1)
sort obs 

list, separator(0)

     +----------------------+
     | obs   id1   id2   id |
     |----------------------|
  1. |   1     a     1    1 |
  2. |   2     b     2    2 |
  3. |   3     c     2    2 |
  4. |   4     d     3    3 |
  5. |   5     e     4    4 |
  6. |   6     b     5    2 |
  7. |   7     f     6    6 |
     +----------------------+

Here's another example to see that this is indeed the case:

clear
input obs str1 id1 id2
1   a   1
2   b   2
3   c   7
4   d   3
5   e   4
6   b   5
7   f   1
end

clonevar id = id2 
group_id id, match(id1)
sort obs

list, separator(0)

     +----------------------+
     | obs   id1   id2   id |
     |----------------------|
  1. |   1     a     1    1 |
  2. |   2     b     2    2 |
  3. |   3     c     7    7 |
  4. |   4     d     3    3 |
  5. |   5     e     4    4 |
  6. |   6     b     5    2 |
  7. |   7     f     1    1 |
     +----------------------+

If you want to have continuous identifiers after the match (that is 5 instead of 6 in the first example and 7 in the second), you will have to adjust those manually.

1
  1. The community-contributed group_twoway (available in SSC) is a direct solution for your issue. You should make sure that id1 and id2 having the same type (numeric or string) and there are no overlap values in their range. Below code works for your example.

    ssc install group_twoway

    gen id2_str = string(id2)
    group_twoway id1 id2_str, gen(id)
    
  2. One more step added to Pearly's solution (with group_id) would make it serving for your issue without manual adjustment.

    clonevar x = id1 
    group_id x, match(id2)
    
    egen id = group(x)
    
  3. If you might feel these packages are too "black box", below code is the ... coding language for your logic desire.

    egen x0 = group(id1)
    egen x1 = min(x0), by(id2)
    
    local i=1
    while r(N) <_N {
    egen x`=`i'+1' = min(x`i'), by(x`=`i'-1')
    count if x`i++' == x`i'
    }
    
    egen id = group(x`i')
    drop x*
    
Romalpa Akzo
  • 599
  • 1
  • 4
  • 12