1

Hi this is really confusing me, as I am using one command on a large datframe:

df.duplicated(subset=None, keep='first)

This looks identical to what the documentation says of:

DataFrame.duplicated(subset=None, keep='first')

I'm just using df instead, however, all I get back is the following traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-53-529f7b7a97fb> in <module>()
----> 1 df.duplicated(subset=None, keep='first')

/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
   4383         vals = (col.values for name, col in self.iteritems()
   4384                 if name in subset)
-> 4385         labels, shape = map(list, zip(*map(f, vals)))
   4386 
   4387         ids = get_group_index(labels, shape, sort=False, xnull=False)

/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in f(vals)
   4364         def f(vals):
   4365             labels, shape = algorithms.factorize(
-> 4366                 vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
   4367             return labels.astype('i8', copy=False), len(shape)
   4368 

/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    176                 else:
    177                     kwargs[new_arg_name] = new_arg_value
--> 178             return func(*args, **kwargs)
    179         return wrapper
    180     return _deprecate_kwarg

/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    628                                            na_sentinel=na_sentinel,
    629                                            size_hint=size_hint,
--> 630                                            na_value=na_value)
    631 
    632     if sort and len(uniques) > 0:

/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
    474     uniques = vec_klass()
    475     labels = table.get_labels(values, uniques, 0, na_sentinel,
--> 476                               na_value=na_value)
    477 
    478     labels = _ensure_platform_int(labels)

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels()

TypeError: unhashable type: 'list'

What am I doing wrong?

iFunction
  • 1,208
  • 5
  • 21
  • 35
  • No an exact duplicate, but you may be able to adapt the [answers from this question](https://stackoverflow.com/questions/50418139/pandas-drop-duplicates-on-elements-made-of-lists) – G. Anderson Aug 14 '19 at 17:45

1 Answers1

0

From what I can understand, you got lists in your data frame and python or Pandas can not hash lists. You may have observed this, in case you ever tried to use lists as keys in a dictionary. A simple workaround would be to convert the lists to tuples which are hashable.

Parijat Bhatt
  • 664
  • 4
  • 6
  • Ah, there shouldn't be any lists in the data, it's a 2.6Gb csv file of 28 known columns. I can find duplicates is I subset the data into groups of columns, but it's the whole file I want to check that our plugin is not double reporting. – iFunction Aug 15 '19 at 07:20
  • Can you write a custom function and check if any value is list or not in the dataframe? if not, please post an example dataframe and column values – Parijat Bhatt Aug 15 '19 at 16:54
  • I know this data very well, there are no fields that can possibly be a list, as it is just error data from devices that our company has configured. I will however try to reingest the data and check each column for a possible list and try again, but any rows with a list in will have to be dropped as it any list data can't be used, our systems are expecting strings only. – iFunction Aug 16 '19 at 11:12