46

Given the following dataframe in pandas:

import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})

where id is an id for each point consisting of an a and b value, how can I bin a and b into a specified set of bins (so that I can then take the median/average value of a and b in each bin)? df might have NaN values for a or b (or both) for any given row in df.

Here's a better example using Joe Kington's solution with a more realistic df. The thing I'm unsure about is how to access the df.b elements for each df.a group below:

a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...
Milan
  • 1,743
  • 2
  • 13
  • 36

4 Answers4

63

There may be a more efficient way (I have a feeling pandas.crosstab would be useful here), but here's how I'd do it:

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100),
                       "b": np.random.random(100),
                       "id": np.arange(100)})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))

# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"

# Similarly, the median:
print groups.median()

# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))

Edit: As the OP was asking specifically for just the means of b binned by the values in a, just do

groups.mean().b

Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cut instead of numpy.digitize. (Kudos to bidamante. I didn't realize pandas.cut existed.)

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), 
                       "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the mean of b, binned by the values in a
print groups.mean().b

This results in:

a
(0.00186, 0.111]    10.421839
(0.111, 0.22]       10.427540
(0.22, 0.33]        10.538932
(0.33, 0.439]       10.445085
(0.439, 0.548]      10.313612
(0.548, 0.658]      10.319387
(0.658, 0.767]      10.367444
(0.767, 0.876]      10.469655
(0.876, 0.986]      10.571008
Name: b
Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • excellent and elegant! exactly what I was looking for. does not require sorting of the dataframe at all. –  Jun 05 '13 at 21:56
  • what if you wanted to access the `b` values based on the groups? `groups.mean()` gives you the means for just `a`, I believe. –  Jun 05 '13 at 22:05
  • @user248237dfsf - No, it gives the mean for both `a` and `b` (or rather, it gives the mean of `b` binned by the values in `a`, which is what I thought you were asking for). – Joe Kington Jun 05 '13 at 23:32
  • 1
    `groups.mean()` returns a `DataFrame` so you could do `groups.mean()["b"]` to access just the means for `b` binned by `a`. – bdiamante Jun 06 '13 at 13:59
27

Not 100% sure if this is what you're looking for, but here's what I think you're getting at:

In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id":   np.arange(100)})

In [145]: bins = [0, .25, .5, .75, 1]

In [146]: a_bins = df.a.groupby(cut(df.a,bins))

In [147]: b_bins = df.b.groupby(cut(df.b,bins))

In [148]: a_bins.agg([mean,median])
Out[148]:
                 mean    median
a
(0, 0.25]    0.124173  0.114613
(0.25, 0.5]  0.367703  0.358866
(0.5, 0.75]  0.624251  0.626730
(0.75, 1]    0.875395  0.869843

In [149]: b_bins.agg([mean,median])
Out[149]:
                 mean    median
b
(0, 0.25]    0.147936  0.166900
(0.25, 0.5]  0.394918  0.386729
(0.5, 0.75]  0.636111  0.655247
(0.75, 1]    0.851227  0.838805

Of course, I don't know what bins you had in mind, so you'll have to swap mine out for your circumstance.

bdiamante
  • 15,980
  • 6
  • 40
  • 46
  • Nice! I assumed the OP wanted to bin "b" by "a", but in retrospect, your answer is probably what they were looking for. I'll leave mine up, as our answers do slightly different things. – Joe Kington Jun 05 '13 at 20:47
  • 1
    Maybe worth to mention that it is `pandas.Dataframe({..})` and `a_bins.agg([numpy.mean,numpy.median])` – Guido Jan 25 '16 at 08:49
16

Joe Kington's answer was very helpful, however, I noticed that it does not bin all of the data. It actually leaves out the row with a = a.min(). Summing up groups.size() gave 99 instead of 100.

To guarantee that all data is binned, just pass in the number of bins to cut() and that function will automatically pad the first[last] bin by 0.1% to ensure all data is included.

df = pandas.DataFrame({"a": np.random.random(100), 
                    "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))

# Get the mean of b, binned by the values in a
print(groups.mean().b)

In this case, summing up groups.size() gave 100.

I know this is a picky point for this particular problem, but for a similar problem I was trying to solve, it was crucial to obtain the correct answer.

Perk
  • 161
  • 1
  • 2
2

If you do not have to stick to pandas grouping, you could use scipy.stats.binned_statistic:

from scipy.stats import binned_statistic

means = binned_statistic(df.a, df.b, bins=np.linspace(min(df.a), max(df.a), 10))
bio
  • 501
  • 1
  • 5
  • 16