Context:
I am trying to understand how top
attribute of describe()
works in python (3.7.3) pandas
(0.24.2).
Efforts hitherto:
I looked into documentation of pandas.DataFrame.describe. It states that:
If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.
I am trying to understand which part of code exactly attributes to the "arbitrary" output.
I stepped into the code which is being called by
describe
in-turn. My traceback is as follows:
describe() #pandas.core.generic
describe_1d() #pandas.core.generic
describe_categorical_1d() #pandas.core.generic
value_counts() #pandas.core.base
value_counts() #pandas.core.algorithms
_value_counts_arraylike() #pandas.core.algorithms
# In the above step it uses hash-table, to find keys and their counts
# I am not able to step further, as further implementations are in C.
Sample Trial:
import pandas as pd
sample = pd.Series(["Down","Up","Up","Down"])
sample.describe()["top"]
The above code can give Down
or Up
randomly, as expected.
Question:
- Which method in the traceback contributes to the randomness of the output?
Does order of keys obtained from hash-table is the reason?
If yes,
-- Is it not every-time, same key have same hash and be fetched in same order?
-- How are keys hashed, iterated (for fetching all keys) and fetched from hash-table?
Any pointer is much appreciated! Thanks in advance :)