How can I drop duplicates within a dataframe that has a colum that's a numpy array?

Question

I'm trying to drop duplicates, it works with normal pandas columns but I'm getting a error when I'm trying to do it on a column that's a numpy array:

new_df = new_df.drop_duplicates(subset=['ticker', 'year', 'embedding'])

I get this error:

4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value, mask)
    509     table = hash_klass(size_hint or len(values))
    510     uniques, codes = table.factorize(
--> 511         values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
    512     )
    513 

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'numpy.ndarray'

Also if it helps here's how my data looks:

ticker  year    embedding
0   a.us    2020.0  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118, 0...
1   a.us    2020.0  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118, 0..

I thought about casting to string but I need the arrays in the pandas column to stay as numpy so I'm not sure how to remove duplicates cleanly here.

[Converting to str](https://stackoverflow.com/questions/43855462/pandas-drop-duplicates-method-not-working) seems to be a solution — jlesuffleur, Mar 11 '21 at 14:06
don't store numpy arrays within the cells. You can have 1000 columns, each for one component of the embeddings. — Quang Hoang, Mar 11 '21 at 15:32

score 1 · Accepted Answer · answered Mar 11 '21 at 14:35

1

Here what I will do:

>>> df
  ticker  year                                     embedding
0   a.us  2020  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118]
1   a.us  2020  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118]

>>> cond1 = df.drop(columns="embedding").duplicated()
>>> cond1
0    False
1     True
dtype: bool

>>> cond2 = pd.DataFrame(df["embedding"].to_list()).duplicated()
>>> cond2
0    False
1     True
dtype: bool

To remove duplicate values:

>>> df[~(cond1 & cond2)]
  ticker  year                                     embedding
0   a.us  2020  [0.0, 0.0, 0.0, 0.62235785, 0.0, 0.27049118]

answered Mar 11 '21 at 14:35

Corralien

109,409
8
28
52

Hi There, when I do cond1 = new_df.drop(columns="embedding").duplicated() - I get the same error above TypeError: unhashable type: 'numpy.ndarray' – Lostsoul Mar 11 '21 at 15:02
What is the output of `new_df.info()`? – Corralien Mar 11 '21 at 17:36

How can I drop duplicates within a dataframe that has a colum that's a numpy array?

1 Answers1