0

I am preparing some data for k-means clustering. At the moment I have the id in 160 bit hash format (this is the format for bitcoin addresses).

d = {'Hash' : pd.Series(['1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6', '3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj', '1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6']), 
     'X1' : pd.Series([111, 222, 333]),
     'X2' : pd.Series([111, 222, 333]),
     'X3' : pd.Series([111, 222, 333])
    }

df1 = (pd.DataFrame(d))
print(df1)

                                 Hash   X1   X2   X3
0   1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6  111  111  111
1  3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj  222  222  222
2   1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6  333  333  333

In order to parse this data into the sklearn.cluster.KMeans¶ algorithm I need to covert the data to np.float or np.array (i think).

Therefore I want to convert the hashes to an integer value, maintaining the relationship across all rows.

This is my attempt:

#REPLACE HASH WITH INT
look_up = {}
count = 0
for index, row in df1.iterrows():
    count +=1
    if row['Hash'] not in look_up:
        look_up[row['Hash']] = count
    else:
        continue
print(look_up)

{'3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj': 2, '1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6': 1}

At this point I run through each of the dictionary and try to replace the hash value with the new integer value.

for index, row in df1.iterrows():
    for address, id_int in look_up.iteritems():
        if address == row['Hash']:            
            df1.set_value(index, row['Hash'], id_int)
print(df1)

Output:

Hash   X1   X2   X3  \
0   1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6  111  111  111   
1  3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj  222  222  222   
2   1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6  333  333  333   

   1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6  3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj  
0                                1.0                                 NaN  
1                                NaN                                 2.0  
2                                1.0                                 NaN  

The output does not replace the hashed address with the integer value. How can I get the following output:

Expected output:

d = {'ID' : pd.Series([1, 2, 1]), 
     'X1' : pd.Series([111, 222, 333]),
     'X2' : pd.Series([111, 222, 333]),
     'X3' : pd.Series([111, 222, 333])
    }

df3 = (pd.DataFrame(d))
print(df3)

   ID   X1   X2   X3
0   1  111  111  111
1   2  222  222  222
2   1  333  333  333

As the hash is the same in row 0 and 2 the same integer id should replace the hash.

Is there a more efficient way of generating these unique ids? At the moment this code take a long time to run.

archienorman
  • 1,434
  • 3
  • 20
  • 36

3 Answers3

1

There are quite a few ways. One way would be to use Categorical codes, and another would be to rank them:

In [16]: df1["via_categ"] = pd.Categorical(df1.Hash).codes + 1

In [17]: df1["via_rank"] = df1["Hash"].rank(method="dense").astype(int)
In [18]: df1
Out[18]: 
                                 Hash   X1   X2   X3  via_categ  via_rank
0   1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6  111  111  111          1         1
1  3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj  222  222  222          2         2
2   1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6  333  333  333          1         1

(You could have dropped the Hash column and created a new ID column equally easily.)

DSM
  • 342,061
  • 65
  • 592
  • 494
0
s = list(set(df1.Hash))
hash2 = dict(zip(s, range(1, len(s) + 1)))
df1.Hash = df1.Hash.map(hash2)
print(df1)

Output:

   Hash   X1   X2   X3
0     2  111  111  111
1     1  222  222  222
2     2  333  333  333
Israel Unterman
  • 13,158
  • 4
  • 28
  • 35
0

You can use sklearn.preprocessing.LabelEncoder:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df1['Hash'])
df1['Hash'] = le.transform(df1['Hash'])

Resulting Output:

   Hash   X1   X2   X3
0     0  111  111  111
1     1  222  222  222
2     0  333  333  333

Also, note that this gives you an easy way to revert back to the original hash by using inverse_transform:

df1['Hash'] = le.inverse_transform(df1['Hash'])
root
  • 32,715
  • 6
  • 74
  • 87
  • is it possible to pre-process the data for k-means where one of the columns is a list? the row would look like this: `0 111 [5, 6] 222 333` – archienorman Aug 22 '16 at 18:32
  • I don't think so. You'd probably want to split the list elements into different rows, i.e. http://stackoverflow.com/questions/38428796/how-to-do-lateral-view-explode-in-pandas – root Aug 22 '16 at 19:09
  • does that mean converting all rows to columns first? at the moment each data record is represented by a row – archienorman Aug 22 '16 at 19:40
  • I think I'm confused as to exactly what you're asking. I see you've written a new question regarding this, which is probably the best course of action. – root Aug 22 '16 at 22:53