1

I have a dataset which looks like this:

   val
   1
   1
   3
   4
   6
   6
   9
   ...

I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:

   val   occurrences
   1     2
   3     1
   4     1
   6     2
   9     1
   ...

and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.

My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?

For example:

1,1,3,4,6,6,9

would be:

df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})

Median is 4. I'm looking for a method to extract median directly from given df.

kubapok
  • 116
  • 1
  • 7
  • 2
    df.val.value_counts().reset_index() – BENY Sep 18 '18 at 15:45
  • Could you please elaborate what is your input and output dataframe. Additionally you can refer [pandas.DataFrame.mean](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html) and [pandas.DataFrame.boxplot](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.boxplot.html) – nandneo Sep 18 '18 at 16:32

1 Answers1

1

No, pandas does not operate on such objects how you would expect. Elsewhere on StackOverflow, even computing a median for that table structure takes at least a few lines of code.

If you wanted to make your own seaborn hooks/wrappers, a good place to start would probably be an efficient percentiles(df, p) method. The median is then just percentiles(df, [50]). A box plot would just be percentiles(df, [0, 25, 50, 75, 100]), and so on. Your development time could then be fairly minimal (depending on how complicated the statistics you need are).

Hans Musgrave
  • 6,613
  • 1
  • 18
  • 37