Creating a unique id through indirect relation python

Question

My question is that creating a uniqe code for the data which has direct relation, and through that direct relation other indirect relation is created for the first and last one. All in all if there such a case I want to assign a unique for the one which has direct and indirect realtion. Below you may find a basic dataset.

Product References  Unique Key
A   1   A1
A   2   A1
A   3   A1
A   a   A1
A   b   A1
A   c   A1
B   1   A1
B   4   A1
B   5   A1
B   x   A1
B   y   A1
B   z   A1
C   6   A1
C   7   A1
C   8   A1
C   x   A1
C   f   A1
C   h   A1
D   12  A2
D   13  A2
D   14  A2
D   15  A2
D   16  A2
D   17  A2
D   18  A2
F   13  A2
F   AB  A2
F   FF  A2
F   NB  A2
F   45  A2
F   63  A2
F   100 A2
F   98  A2
F   FGA A2
F   CA  A2

As you may see A has A relation with B but not with C. But B has a relation with C. In this case actually A and C is also related. And the en d of the dat Iwant to assing a key to those products.

And you may consider data consists of millions of row and each time a new relation is found a new iteration should run.

Explain problem, knowledge request, coding and library recommandations. How to create algortihm.

please provide a [minimal reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — mozway, Jun 09 '23 at 20:06
I tried to provide a simple data set thanks for the quick reply — gartuns, Jun 09 '23 at 20:27

mozway · Accepted Answer · 2023-06-09T21:04:54.437

With networkx, you can use nx.connected_components and map:

import networkx as nx

# create graph
G = nx.from_pandas_edgelist(df, source='Product', target='References')

# find connected components, map unique key per group
keys = {n: f'A{i}' for i, s in enumerate(nx.connected_components(G), start=1)
        for n in s}

# map unique key based on product group
df['Unique Key'] = df['Product'].map(keys)

To improve memory efficiency, you can also pre-filter the nodes:

# only keep nodes that connect to 2 products or more
G = nx.from_pandas_edgelist(df[df['References'].duplicated(keep=False)], source='Product', target='References')

products = set(df['Product'])
G.add_nodes_from(products) # ensure all products are in the graph
keys = {n: f'A{i}' for i, s in enumerate(nx.connected_components(G), start=1)
        for n in s&products}

df['Unique Key'] = df['Product'].map(keys)

Output:

   Product References Unique Key
0        A          1         A1
1        A          2         A1
2        A          3         A1
3        A          a         A1
4        A          b         A1
5        A          c         A1
6        B          1         A1
7        B          4         A1
8        B          5         A1
9        B          x         A1
10       B          y         A1
11       B          z         A1
12       C          6         A1
13       C          7         A1
14       C          8         A1
15       C          x         A1
16       C          f         A1
17       C          h         A1
18       D         12         A2
19       D         13         A2
20       D         14         A2
21       D         15         A2
22       D         16         A2
23       D         17         A2
24       D         18         A2
25       F         13         A2
26       F         AB         A2
27       F         FF         A2
28       F         NB         A2
29       F         45         A2
30       F         63         A2
31       F        100         A2
32       F         98         A2
33       F        FGA         A2
34       F         CA         A2

The graph:

Simplified graph in the optimized version:

Thank you. import networkx as nx ModuleNotFoundError: No module named 'networkx' I recieve this error even networkx is installed — gartuns, Jun 09 '23 at 22:05
Ensure it's installed (`pip install networkx` in the command line) **for the current python version** (check it with `import sys ; print(sys.version)`) — mozway, Jun 09 '23 at 22:08
In my dictionary comprehension, I loop over `nx.connected_components(G)`, then combine the results in a single dictionary. Instead you could handle the subgraphs separately using a classical loop. — mozway, Jun 18 '23 at 08:00

Creating a unique id through indirect relation python

1 Answers1