Why is pd.unique() faster than np.unique()?

Question

I tried to compare the two, one is pandas.unique() and another one is numpy.unique(), and I found out that the latter actually surpass the first one.
I am not sure whether the excellency is linear or not.

Can anyone please tell me why such a difference exists, with regards to the code implementation? In what case should I use which?

Do not have a direct answer -- never dug deep enough, but Pandas calls ous the speed of its `.unique()` in the documentation itself. https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.unique.html — dozyaustin, Nov 15 '18 at 00:06
`unique` doesn't make special use of numpy multidimensionality. It's a very different kind of operation than sum and multiply. It sorts a 1d array, and then looks for adjacent duplicates. np.lib.arraysetops._unique1d — hpaulj, Nov 15 '18 at 01:32
Also, np.unique enables a lot more than pandas unique. like returning the indices of where they were found, the ability to reconstruct the original array and the counts of the unique values that were found. — NaN, Nov 15 '18 at 02:56
@hpaulj - as per the documentation pointed out by @dozyaustin the speed of `pandas` `unique()` is not related to sorting (which would require a lot of additional memory and time; on the access to the first unique element). Rather, the operation uses a hashmap to track elements that the iterator has already visited on its course through the dataframe. — sophros, Nov 15 '18 at 07:20

score 4 · Answer 1 · answered Jun 16 '20 at 06:59

np.unique() is treating the data as an array, so it goes through every value individually then identifies the unique fields.

whereas, pandas has pre-built metadata which contains this information and pd.unique() is simply calling on the metadata which contains 'unique' info, so it doesn't have to calculate it again.

Why is pd.unique() faster than np.unique()?

1 Answers1