0

Assume a big set of data like

   Height (m)  My data
0          18      5.0
1          25      6.0
2          10      1.0
3          13      1.5
4          32      8.0
5          26      6.7
6          23      5.0
7           5      2.0
8           7      2.0

And I want to plot the average (and, if possible, the standard deviation) of "My data" as a function of height, separated in the range [0,5),[5,10),[10,15) and so on.

Any idea? I've tried different approaches and none of them work

Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
MatMorPau22
  • 346
  • 2
  • 3
  • 18
  • 1
    Take a moment to check your formatting, this is raw html, not data. After that, have you tried groupby and cut/qcut? – G. Anderson Mar 28 '19 at 21:41
  • 3
    see [this post](https://stackoverflow.com/a/20159305/2327328) about making good pandas examples – philshem Mar 28 '19 at 21:41
  • You can create a column like df[['height_bucket']] = df[['Height (m)']]//5 and then use groupby on this column? – Vasilis D Mar 28 '19 at 21:42
  • @G.Anderson I know, it's just to show the example. Otherwise, I must upload a 110 Mb flie – MatMorPau22 Mar 28 '19 at 21:43
  • I think @G.Anderson means an example of your python/pandas code – philshem Mar 28 '19 at 21:49
  • 1
    @MatMorPau22, I hope you don't mind my edit of your question: I used `pd.read_html()` to convert the raw HTML that you posted into a plaintext representation of your example data. – Peter Leimbigler Mar 28 '19 at 21:50

2 Answers2

2

If I understand you correctly:

# Precompute bins for pd.cut
bins = list(range(0, df['Height (m)'].max() + 5, 5))

# Cut Height into intervals which exclude the right endpoint, 
# with bin edges at multiples of 5
df['HeightBin'] = pd.cut(df['Height (m)'], bins=bins, right=False)

# Within each bin, get mean, stdev (normalized by N-1 by default),
# and also show sample size to explain why some std values are NaN
df.groupby('HeightBin')['My data'].agg(['mean', 'std', 'count'])
            mean       std  count
HeightBin
[0, 5)       NaN       NaN      0
[5, 10)     2.00  0.000000      2
[10, 15)    1.25  0.353553      2
[15, 20)    5.00       NaN      1
[20, 25)    5.00       NaN      1
[25, 30)    6.35  0.494975      2
[30, 35)    8.00       NaN      1
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
1

If I understand correctly, this is what you would like to do:

import pandas as pd
import numpy as np

bins = np.arange(0, 30, 5) # adjust as desired

df_stats = pd.DataFrame(columns=['mean', 'st_dev']) # DataFrame for the results
df_stats['mean'] = df.groupby(pd.cut(df['Height (m)'], bins, right=False)).mean()['My data']
df_stats['st_dev'] = df.groupby(pd.cut(df['Height (m)'], bins, right=False)).std()['My data']
bexi
  • 1,186
  • 5
  • 9