7

I tried to compare the two, one is pandas.unique() and another one is numpy.unique(), and I found out that the latter actually surpass the first one.
I am not sure whether the excellency is linear or not.

Can anyone please tell me why such a difference exists, with regards to the code implementation? In what case should I use which?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Songcheng Li
  • 181
  • 7
  • Do not have a direct answer -- never dug deep enough, but Pandas calls ous the speed of its `.unique()` in the documentation itself. https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.unique.html – dozyaustin Nov 15 '18 at 00:06
  • `unique` doesn't make special use of numpy multidimensionality. It's a very different kind of operation than sum and multiply. It sorts a 1d array, and then looks for adjacent duplicates. np.lib.arraysetops._unique1d – hpaulj Nov 15 '18 at 01:32
  • Also, np.unique enables a lot more than pandas unique. like returning the indices of where they were found, the ability to reconstruct the original array and the counts of the unique values that were found. – NaN Nov 15 '18 at 02:56
  • 4
    @hpaulj - as per the documentation pointed out by @dozyaustin the speed of `pandas` `unique()` is not related to sorting (which would require a lot of additional memory and time; on the access to the first unique element). Rather, the operation uses a hashmap to track elements that the iterator has already visited on its course through the dataframe. – sophros Nov 15 '18 at 07:20

1 Answers1

4

np.unique() is treating the data as an array, so it goes through every value individually then identifies the unique fields.

whereas, pandas has pre-built metadata which contains this information and pd.unique() is simply calling on the metadata which contains 'unique' info, so it doesn't have to calculate it again.