0

For example, I have a pandas dataframe like this :

Original Dataframe

Ignoring the "Name" column, I want a dataframe that looks like this, labelling the Hashes of the same group with their "ID"

Processed Image

Here, we traverse each row, we encounter "8a43", and assign ID 1 to it, and wherever we find the same hash value, we assign ID as 1. Then we move on to the next row, and encounter 79e2 and b183. We then traverse all the rows and wherever we find these values, we store their IDs as 2. Now the issue will arise when we reach "abc7". It will be assigned ID=5 as it was previously encountered in "abc5". But I also want that in rows after the current one, wherever I find "26ea", assign the ID=5 to those as well.

I hope all this makes sense. If not, feel free to reach out to me via comments or message. I will clear it out quickly.

Chintan Mehta
  • 129
  • 1
  • 9

2 Answers2

1

Use networkx solution for dictionary for common values, select first value in Hash_Value by str and use Series.map:

#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')

import networkx as nx

G=nx.Graph()
for l in df['Hash_Value']:
    nx.add_path(G, l)

new = list(nx.connected_components(G))

print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]

mapped =  {node: cid for cid, component in enumerate(new) for node in component}

df['ID'] = df['Hash_Value'].str[0].map(mapped) + 1

print (df)
           Hash_Value   Name  ID
0              [8a43]   abcl   1
1        [79e2, b183]   abc2   2
2              [f82a]   abc3   3
3              [b183]   abc4   2
4  [eaa7, 5ea9, 1cee]   abc5   4
5              [5ea9]   abc6   4
6        [1cee, 26ea]   abc7   4
7              [79e2]   abc8   2
8              [8a43]   abc9   1
9              [26ea]  abc10   4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Solution using dict

import numpy as np
import pandas as pd

hashvalues = list(df['Hash_Value'])

dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
    # convert to list
    if isinstance(hashlist, str):
        hashlist = hashlist.replace('[','').replace(']', '')
        hashlist = hashlist.split(',')

        # check if the hash is unknown
        if hashlist[0] not in dic:
            # Assign a new id
            dic[hashlist[0]] = i
            k = i
            i += 1
        else:
            # if known use existing id
            k = dic[hashlist[0]]
            
        for h in hashlist[1:]:
            # set id of the rest of the list hashes
            # equal to the first hashes's id
            dic[h] = k
            
        id_list.append(k)
    else:
        id_list.append(np.nan)
    
     print(df)

               Hash   Name  ID
0            [8a43]   abc1   1
1       [79e2,b183]   abc2   2
2            [f82a]   abc3   3
3            [b183]   abc4   2
4  [eaa7,5ea9,1cee]   abc5   4
5            [5ea9]   abc6   4
6       [1cee,26ea]   abc7   4
7            [79e2]   abc8   2
8            [8a43]   abc9   1
9            [26ea]  abc10   4
Julkar9
  • 1,508
  • 1
  • 12
  • 24
  • How to handle the null values in it? Im getting an error " 'NoneType' object has no attribute 'replace' " on line 7 Such that the order of the dataframe is not messed up – Chintan Mehta Dec 20 '21 at 12:36
  • @ChintanMehta What will be the corresponding ID if hash value is null ? – Julkar9 Dec 20 '21 at 12:39
  • 1
    I resolved it, Thanks! I added an if-else statement, that when the Hash value is null, append null to id_list as well – Chintan Mehta Dec 20 '21 at 12:50