1

My question is that creating a uniqe code for the data which has direct relation, and through that direct relation other indirect relation is created for the first and last one. All in all if there such a case I want to assign a unique for the one which has direct and indirect realtion. Below you may find a basic dataset.

Product References  Unique Key
A   1   A1
A   2   A1
A   3   A1
A   a   A1
A   b   A1
A   c   A1
B   1   A1
B   4   A1
B   5   A1
B   x   A1
B   y   A1
B   z   A1
C   6   A1
C   7   A1
C   8   A1
C   x   A1
C   f   A1
C   h   A1
D   12  A2
D   13  A2
D   14  A2
D   15  A2
D   16  A2
D   17  A2
D   18  A2
F   13  A2
F   AB  A2
F   FF  A2
F   NB  A2
F   45  A2
F   63  A2
F   100 A2
F   98  A2
F   FGA A2
F   CA  A2

As you may see A has A relation with B but not with C. But B has a relation with C. In this case actually A and C is also related. And the en d of the dat Iwant to assing a key to those products.

And you may consider data consists of millions of row and each time a new relation is found a new iteration should run.

Explain problem, knowledge request, coding and library recommandations. How to create algortihm.

gartuns
  • 47
  • 6
  • 1
    please provide a [minimal reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – mozway Jun 09 '23 at 20:06
  • I tried to provide a simple data set thanks for the quick reply – gartuns Jun 09 '23 at 20:27

1 Answers1

1

With networkx, you can use nx.connected_components and map:

import networkx as nx

# create graph
G = nx.from_pandas_edgelist(df, source='Product', target='References')

# find connected components, map unique key per group
keys = {n: f'A{i}' for i, s in enumerate(nx.connected_components(G), start=1)
        for n in s}

# map unique key based on product group
df['Unique Key'] = df['Product'].map(keys)

To improve memory efficiency, you can also pre-filter the nodes:

# only keep nodes that connect to 2 products or more
G = nx.from_pandas_edgelist(df[df['References'].duplicated(keep=False)], source='Product', target='References')

products = set(df['Product'])
G.add_nodes_from(products) # ensure all products are in the graph
keys = {n: f'A{i}' for i, s in enumerate(nx.connected_components(G), start=1)
        for n in s&products}

df['Unique Key'] = df['Product'].map(keys)

Output:

   Product References Unique Key
0        A          1         A1
1        A          2         A1
2        A          3         A1
3        A          a         A1
4        A          b         A1
5        A          c         A1
6        B          1         A1
7        B          4         A1
8        B          5         A1
9        B          x         A1
10       B          y         A1
11       B          z         A1
12       C          6         A1
13       C          7         A1
14       C          8         A1
15       C          x         A1
16       C          f         A1
17       C          h         A1
18       D         12         A2
19       D         13         A2
20       D         14         A2
21       D         15         A2
22       D         16         A2
23       D         17         A2
24       D         18         A2
25       F         13         A2
26       F         AB         A2
27       F         FF         A2
28       F         NB         A2
29       F         45         A2
30       F         63         A2
31       F        100         A2
32       F         98         A2
33       F        FGA         A2
34       F         CA         A2

The graph:

enter image description here

Simplified graph in the optimized version:

enter image description here

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thank you. import networkx as nx ModuleNotFoundError: No module named 'networkx' I recieve this error even networkx is installed – gartuns Jun 09 '23 at 22:05
  • 1
    Ensure it's installed (`pip install networkx` in the command line) **for the current python version** (check it with `import sys ; print(sys.version)`) – mozway Jun 09 '23 at 22:08
  • @Mozwat do you how can graph only one unique id – gartuns Jun 17 '23 at 21:59
  • In my dictionary comprehension, I loop over `nx.connected_components(G)`, then combine the results in a single dictionary. Instead you could handle the subgraphs separately using a classical loop. – mozway Jun 18 '23 at 08:00