I am preparing some data for k-means clustering. At the moment I have the id in 160 bit hash format (this is the format for bitcoin addresses).
d = {'Hash' : pd.Series(['1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6', '3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj', '1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6']),
'X1' : pd.Series([111, 222, 333]),
'X2' : pd.Series([111, 222, 333]),
'X3' : pd.Series([111, 222, 333])
}
df1 = (pd.DataFrame(d))
print(df1)
Hash X1 X2 X3
0 1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 111 111 111
1 3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj 222 222 222
2 1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 333 333 333
In order to parse this data into the sklearn.cluster.KMeans¶ algorithm I need to covert the data to np.float or np.array (i think).
Therefore I want to convert the hashes to an integer value, maintaining the relationship across all rows.
This is my attempt:
#REPLACE HASH WITH INT
look_up = {}
count = 0
for index, row in df1.iterrows():
count +=1
if row['Hash'] not in look_up:
look_up[row['Hash']] = count
else:
continue
print(look_up)
{'3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj': 2, '1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6': 1}
At this point I run through each of the dictionary and try to replace the hash value with the new integer value.
for index, row in df1.iterrows():
for address, id_int in look_up.iteritems():
if address == row['Hash']:
df1.set_value(index, row['Hash'], id_int)
print(df1)
Output:
Hash X1 X2 X3 \
0 1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 111 111 111
1 3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj 222 222 222
2 1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 333 333 333
1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj
0 1.0 NaN
1 NaN 2.0
2 1.0 NaN
The output does not replace the hashed address with the integer value. How can I get the following output:
Expected output:
d = {'ID' : pd.Series([1, 2, 1]),
'X1' : pd.Series([111, 222, 333]),
'X2' : pd.Series([111, 222, 333]),
'X3' : pd.Series([111, 222, 333])
}
df3 = (pd.DataFrame(d))
print(df3)
ID X1 X2 X3
0 1 111 111 111
1 2 222 222 222
2 1 333 333 333
As the hash is the same in row 0
and 2
the same integer id should replace the hash.
Is there a more efficient way of generating these unique ids? At the moment this code take a long time to run.