How pandas describe() - top works when multiple elements have highest count?

Question

Context:

I am trying to understand how top attribute of describe() works in python (3.7.3) pandas (0.24.2).

Efforts hitherto:

I looked into documentation of pandas.DataFrame.describe. It states that:

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

I am trying to understand which part of code exactly attributes to the "arbitrary" output.
I stepped into the code which is being called by describe in-turn. My traceback is as follows:

describe()  #pandas.core.generic
describe_1d()  #pandas.core.generic
describe_categorical_1d()  #pandas.core.generic
value_counts()  #pandas.core.base
value_counts()  #pandas.core.algorithms
_value_counts_arraylike()  #pandas.core.algorithms
# In the above step it uses hash-table, to find keys and their counts
# I am not able to step further, as further implementations are in C.

Sample Trial:

import pandas as pd
sample = pd.Series(["Down","Up","Up","Down"])
sample.describe()["top"]

The above code can give Down or Up randomly, as expected.

Question:

Which method in the traceback contributes to the randomness of the output?
Does order of keys obtained from hash-table is the reason?

If yes,

-- Is it not every-time, same key have same hash and be fetched in same order?

-- How are keys hashed, iterated (for fetching all keys) and fetched from hash-table?

Any pointer is much appreciated! Thanks in advance :)

@Goyo, Thanks for your inputs. I would like to clarify my understanding of `arbitrary` and `random`. Does arbitrary mean it is a `pseudorandom` generated value , rather than a random value? — Sha, Jun 05 '19 at 08:26

w-m · Accepted Answer · 2019-06-05T09:17:33.350

As pointed out above, it gives "Down" arbitrarily, but not randomly. On the same machine with the same Pandas version, running the above code should always yield the same result (although it's not guaranteed by the docs, see comments below).

Let's reproduce what's happening.

Given this series:

abc = pd.Series(list("abcdefghijklmnoppqq"))

The value_counts implementation boils down to this:

import pandas._libs.hashtable as htable
keys, counts = htable.value_count_object(np.asarray(abc), True)
result = pd.Series(counts, index=keys)

result:

g    1
e    1
f    1
h    1
o    1
d    1
b    1
q    2
j    1
k    1
i    1
p    2
n    1
l    1
c    1
m    1
a    1
dtype: int64

The order of the result is given by the implementation of the hash table. It is the same for every call.

You could look into the implementation of value_count_object, which calls build_count_table_object, which uses the khash implementation to get more details about the hashing.

After computing the table, the value_counts implementation is sorting the results with quicksort. This sort is not stable and with this specially constructed example reorders "p" and "q":

result.sort_values(ascending=False)

q    2
p    2
a    1
e    1
f    1
h    1
o    1
d    1
b    1
j    1
m    1
k    1
i    1
n    1
l    1
c    1
g    1
dtype: int64

Thus there are potentially two factors for the ordering: first the hashing, and second the non-stable sort.

The displayed top value is then just the first entry of the sorted list, in this case, "q".

On my machine, quicksort becomes non-stable at 17 entries, this is why I chose the example above.

We can test the non-stable sort with this direct comparison:

pd.Series(list("abcdefghijklmnoppqq")).describe().top
'q'

pd.Series(list(               "ppqq")).describe().top
'p'

"On the same machine with the same Pandas version, running the above code will always yield the same result." I don't think you should rely on that if the docs don't say it explicitly. Arbitrary doesn't mean deterministic either. — Stop harming Monica, Jun 04 '19 at 16:57
I believe that in practice, using the word "arbitrarily" implies a deterministic result. I'm quite certain that calling `.describe()` on the same data twice and getting different answers would be considered a bug. Also I don't see any source of nondeterminism in the actual code. You are of course correct that the doc doesn't explicitly exclude such behavior and thus you shouldn't rely on it in situations of importance. — w-m, Jun 04 '19 at 17:26
Thanks @w-m , I re-tried in my system and got output of htable (`result`) order different and also understand , sort_values uses `quick-sort` which is an unstable algorithm. — Sha, Jun 05 '19 at 08:23

How pandas describe() - top works when multiple elements have highest count?

1 Answers1