How to work with aggregated data in pandas?

Question

I have a dataset which looks like this:

I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:

   val   occurrences
   1     2
   3     1
   4     1
   6     2
   9     1
   ...

and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.

My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?

For example:

1,1,3,4,6,6,9

would be:

df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})

Median is 4. I'm looking for a method to extract median directly from given df.

Could you please elaborate what is your input and output dataframe. Additionally you can refer [pandas.DataFrame.mean](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html) and [pandas.DataFrame.boxplot](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.boxplot.html) — nandneo, Sep 18 '18 at 16:32

score 1 · Answer 1 · answered Sep 18 '18 at 16:36

No, pandas does not operate on such objects how you would expect. Elsewhere on StackOverflow, even computing a median for that table structure takes at least a few lines of code.

If you wanted to make your own seaborn hooks/wrappers, a good place to start would probably be an efficient percentiles(df, p) method. The median is then just percentiles(df, [50]). A box plot would just be percentiles(df, [0, 25, 50, 75, 100]), and so on. Your development time could then be fairly minimal (depending on how complicated the statistics you need are).

How to work with aggregated data in pandas?

1 Answers1