3
import pandas as pd

data={'x':['A','A','B','B','C','E','F'],
      'y':['B','C','A','C','D','F','G']}
df=pd.DataFrame(data)

print(df)

I have a big dataframe like this one (simplified with ABC):

     x    y
0    A    B
1    A    C
2    B    A
3    B    C
4    C    D
5    E    F
6    F    G

There are "loops" like row 0: A <-> B and row 2: B <-> A which mean the same relation for me.

I want to have the relation between the x and y column values and give them a unique new id.

So for this example table this means:

A = B = C = D give this a unique id, i.e. 90 E = F = G give this a unique id, i.e. 91

The Result table i need should be:

    id  value
0   90    A
1   90    B
2   90    C 
3   90    D
4   91    E
5   91    F
6   91    G

How can i achieve this with pandas? Help will be very much appreciated!

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
steff9488
  • 139
  • 1
  • 9

1 Answers1

6

This seems like a graph, ie networkx library, problem. Let's look for nodes in connected components within a graph network (see this wiki page).

import pandas as pd
import networkx as nx

data={'x':['A','A','B','B','C','E','F'],
      'y':['B','C','A','C','D','F','G']}
df=pd.DataFrame(data)
G = nx.from_pandas_edgelist(df, 'x','y')
g = nx.connected_components(G)
S = pd.Series()
for i,n in enumerate(g):
    s = pd.Series(sorted(list(n)), index=[i]*len(n))
    S = pd.concat([S, s])

S

Output:

0    A
0    B
0    C
0    D
1    E
1    F
1    G
dtype: object
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    This looks brilliant! Gonna test it tomorrow with the real data. I let you know if it worked out. Thank you very much! – steff9488 Jul 24 '18 at 20:00
  • 1
    Indeed brilliant, nice catch using graphs – rafaelc Jul 24 '18 at 20:03
  • 1
    It worked fine. Just added some lines to get it into a Dataframe in the format i want and the index into a seperate column: `result = pd.DataFrame(S)` `result = result.rename(columns={0:'Other_ID'})` `result['New_ID'] = S.index` ~Cheers – steff9488 Jul 25 '18 at 09:16
  • 1
    One more thing, the theory behind this is a breadth-first-search: [connected component graph theory wikipedia](https://en.wikipedia.org/wiki/Connected_component_(graph_theory)) – steff9488 Jul 25 '18 at 09:22
  • @steff9488 Awesome. Good find on that wiki page. I'll add to the text of this problem. Happy coding! – Scott Boston Jul 25 '18 at 12:42