How to iterate over a pandas dataframe and compare certain columns based on a third column?

Question

I'm new to pandas and have difficulties in using its power in a convenient way.

I have a large dataframe with experimental data for two different tests which I'd like to compare. Ideally, the data is displayed in a plot.

## what I have:
import pandas as pd

ids = [
    'Bob','Bob',
    'John', 'John',
    'Mary', 'Mary',
    ]
var = [
    'a', 'b',
    'a', 'b',
    'a', 'b',
    ]
data = [
    10,11,
    15,14,
    10,15
    ]
dataset = zip(ids, var, data)
print dataset

columns = ['ids', 'var', 'data']
df = pd.DataFrame(data = dataset, columns=columns)
print df

## what I want:
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

fig = plt.figure()
ax1 = fig.add_subplot(111)
for i,ii in enumerate(ids):
    if var[i] == 'a':
        ax1.plot(i/2, data[i], 'rs', label='var a')
    else:
        ax1.plot((i-1)/2, data[i], 'bo', label='var b')
majorLocator = MultipleLocator(1)
ax1.xaxis.set_major_locator(majorLocator)
ax1.grid()
ax1.margins(0.05)
ax1.set_xlabel('ids')
ax1.set_ylabel('data')
ax1.legend(loc='best', numpoints=1)
fig.show()

How can I do this properly without many many nested for loops? A plus would be if I could use the ids as the xlabels...

Thanks a lot in advance, Daniel

i'm confused about what exactly you want the plot to show. – abcd Apr 17 '15 at 17:53 — abcd, Apr 17 '15 at 17:53

score 1 · Answer 1 · edited May 23 '17 at 11:43

1

I'm not quite sure what you want end-goal wise, but if cphlewis's suggestion to go with seaborn isn't what you were looking for, you might try converting your DataFrame to a multiindex, instead, and plotting it out that way.

mi = pd.DataFrame(data=data,index=[ids,var],columns=['data'])
f, a = plt.subplots()
mi.plot(kind='bar',ax=a)

multiindex plot results

It might also be helpful to reference this post.

edited May 23 '17 at 11:43

Community

1
1

answered Apr 18 '15 at 00:01

andrewgcross

253
2
13

This definitely looks interesting. If not now, it could come in handy later, thanks! – damada Apr 18 '15 at 19:32

cphlewis · Accepted Answer · 2015-04-18T19:47:37.713

0

seaborn does a lot of this for you, very flexibly:

import seaborn as sns
sns.factorplot('ids', 'data', hue='var', kind='bar', data=df)

enter image description here

(it also restyles the plotting defaults, which can be changed or reset).

If you want to subset the data, pass the subset as the data argument:

sns.factorplot('ids', 'data', hue='var', kind='bar', 
               data=df[df.isin({'ids':['Bob','Mary']}).any(1)])

enter image description here

that's with sns style turned off
for any more complicated mask, you'd set up the mask separately; see the pandas docs

edited Apr 18 '15 at 19:47

answered Apr 17 '15 at 20:34

cphlewis

15,759
4
46
55

This seems like exactly what I need and want plotwise! Now, how can I combine this with the data filtering capabilities of pandas? For example, I only want to plot vars a and b, not c. Would I need to change the dataframe before plotting? – damada Apr 18 '15 at 19:29
yes, seaborn and pandas work very well together -- see http://pandas.pydata.org/pandas-docs/stable/indexing.html for MANY MANY ways to subset `df`. – cphlewis Apr 18 '15 at 19:44
(and I put a filtered example into the answer) – cphlewis Apr 18 '15 at 21:28
Fantastic! I've just realized my Pandas module is outdated, too, thanks to your example. – damada Apr 20 '15 at 09:52

How to iterate over a pandas dataframe and compare certain columns based on a third column?

2 Answers2