I'm trying to drop duplicates, it works with normal pandas columns but I'm getting a error when I'm trying to do it on a column that's a numpy array:
new_df = new_df.drop_duplicates(subset=['ticker', 'year', 'embedding'])
I get this error:
4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value, mask)
509 table = hash_klass(size_hint or len(values))
510 uniques, codes = table.factorize(
--> 511 values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
512 )
513
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'numpy.ndarray'
Also if it helps here's how my data looks:
ticker year embedding
0 a.us 2020.0 [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118, 0...
1 a.us 2020.0 [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118, 0..
I thought about casting to string but I need the arrays in the pandas column to stay as numpy so I'm not sure how to remove duplicates cleanly here.