1

I'm trying to drop duplicates, it works with normal pandas columns but I'm getting a error when I'm trying to do it on a column that's a numpy array:

new_df = new_df.drop_duplicates(subset=['ticker', 'year', 'embedding'])

I get this error:

4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value, mask)
    509     table = hash_klass(size_hint or len(values))
    510     uniques, codes = table.factorize(
--> 511         values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
    512     )
    513 

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'numpy.ndarray'

Also if it helps here's how my data looks:

ticker  year    embedding
0   a.us    2020.0  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118, 0...
1   a.us    2020.0  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118, 0..

I thought about casting to string but I need the arrays in the pandas column to stay as numpy so I'm not sure how to remove duplicates cleanly here.

Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • [Converting to str](https://stackoverflow.com/questions/43855462/pandas-drop-duplicates-method-not-working) seems to be a solution – jlesuffleur Mar 11 '21 at 14:06
  • don't store numpy arrays within the cells. You can have 1000 columns, each for one component of the embeddings. – Quang Hoang Mar 11 '21 at 15:32

1 Answers1

1

Here what I will do:

>>> df
  ticker  year                                     embedding
0   a.us  2020  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118]
1   a.us  2020  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118]

>>> cond1 = df.drop(columns="embedding").duplicated()
>>> cond1
0    False
1     True
dtype: bool

>>> cond2 = pd.DataFrame(df["embedding"].to_list()).duplicated()
>>> cond2
0    False
1     True
dtype: bool

To remove duplicate values:

>>> df[~(cond1 & cond2)]
  ticker  year                                     embedding
0   a.us  2020  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118]
Corralien
  • 109,409
  • 8
  • 28
  • 52