I have a dataset which looks like this:
val
1
1
3
4
6
6
9
...
I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:
val occurrences
1 2
3 1
4 1
6 2
9 1
...
and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.
My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?
For example:
1,1,3,4,6,6,9
would be:
df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})
Median is 4. I'm looking for a method to extract median directly from given df.