2

I have a table of ids, and previous ids (see image 1), I want to count the number of unique ids in total linked in one chain, e.g. if we take the latest id as the 'parent' then the result for the example data below would be something like Image 2, where 'a' is linked to 5 total ids (a, b, c, d & e) and 'w' is linked to 4 ids (w, x, y & z). In practicality, I am dealing with randomly generated ids, not sequenced letters.

Image1 - Raw Data

Tamge2 - Output

Python Code to produce example dataframes:

import pandas as pd

raw_data = pd.DataFrame([['a','b'], ['b','c'], ['c', 'd'],['d','e'],['e','-'],
                         ['w','x'], ['x', 'y'], ['y','z'], ['z','-']], columns=['id', 'previous_id'])

output = pd.DataFrame([['a',5],['w',4]], columns = ['parent_id','linked_ids'])
  • 3
    Please provide a small set of sample data in the form of text that we can copy and paste. Include the corresponding desired result. Check out the guide on [how to make good reproducible pandas examples](https://stackoverflow.com/a/20159305/3620003). – timgeb Jun 01 '20 at 09:17

1 Answers1

0

Use convert_matrix.from_pandas_edgelist with connected_components first, then create dictionary for mapping, get first mapped values per groups by Series.map filtered by Series.duplicated and last add new column by Series.map with Counter for mapp dictionary:

df1 = raw_data[raw_data['previous_id'].ne('-')]

import networkx as nx
from collections import Counter

g = nx.from_pandas_edgelist(df1,'id','previous_id')

connected_components = nx.connected_components(g)

d = {y:i for i, x in enumerate(connected_components) for y in x}
print (d)
{'c': 0, 'e': 0, 'b': 0, 'd': 0, 'a': 0, 'y': 1, 'x': 1, 'w': 1, 'z': 1}

c = Counter(d.values())
mapp = {k: c[v] for k, v in d.items()}
print (mapp)

{'c': 5, 'e': 5, 'b': 5, 'd': 5, 'a': 5, 'y': 4, 'x': 4, 'w': 4, 'z': 4}

df = (raw_data.loc[~raw_data['id'].map(d).duplicated(), ['id']]
            .rename(columns={'id':'parent_id'})
            .assign(linked_ids = lambda x: x['parent_id'].map(mapp)))

print (df)
  parent_id  linked_ids
0         a           5
5         w           4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252