39

I'm starting to learn Pandas and am trying to find the most Pythonic (or panda-thonic?) ways to do certain tasks.

Suppose we have a DataFrame with columns A, B, and C.

  • Column A contains boolean values: each row's A value is either true or false.
  • Column B has some important values we want to plot.

What we want to discover is the subtle distinctions between B values for rows that have A set to false, vs. B values for rows that have A is true.

In other words, how can I group by the value of column A (either true or false), then plot the values of column B for both groups on the same graph? The two datasets should be colored differently to be able to distinguish the points.


Next, let's add another feature to this program: before graphing, we want to compute another value for each row and store it in column D. This value is the mean of all data stored in B for the entire five minutes before a record - but we only include rows that have the same boolean value stored in A.

In other words, if I have a row where A=True and time=t, I want to compute a value for column D that is the mean of B for all records from time t-5 to t that have the same A=True.

In this case, how can we execute the groupby on values of A, then apply this computation to each individual group, and finally plot the D values for the two groups?

Maxim Zaslavsky
  • 17,787
  • 30
  • 107
  • 173
  • 4
    Do you have some example dataframes? it seems like you can do something like saving the groupby object in a variable: `grouped = df.groupby('A')`, then do a for-loop to plot: `for g, d in grouped: plot(d['B'], color=g)`. More or less the same thing for the second question, where you can use pandas `rolling_mean` to create the new column D. – herrfz Mar 17 '13 at 20:26

1 Answers1

43

I think @herrfz hit all the high points. I'll just flesh out the details:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

sin = np.sin
cos = np.cos
pi = np.pi
N = 100

x = np.linspace(0, pi, N)
a = sin(x)
b = cos(x)

df = pd.DataFrame({
    'A': [True]*N + [False]*N,
    'B': np.hstack((a,b))
    })

for key, grp in df.groupby(['A']):
    plt.plot(grp['B'], label=key)
    grp['D'] = pd.rolling_mean(grp['B'], window=5)    
    plt.plot(grp['D'], label='rolling ({k})'.format(k=key))
plt.legend(loc='best')    
plt.show()

enter image description here

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • This is perfect! Could you touch on how to implement more customized computations for the D column, if say I wanted to make some specialized computation that isn't covered by a built-in "rolling" Pandas function? Thanks. (@herrfz) – Maxim Zaslavsky Mar 18 '13 at 07:06
  • `rolling_mean` is just one of [many rolling functions in Pandas](http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments). To define a custom rolling function, use `rolling_apply`. There is an example on the linked page. – unutbu Mar 18 '13 at 10:42
  • Thanks. I'm having trouble adapting that example to what I'm trying to accomplish, so I asked another question here: http://stackoverflow.com/questions/15487022/customizing-rolling-apply-function-in-python-pandas – Maxim Zaslavsky Mar 18 '13 at 21:17