1

Data

Below is the data frame I wish to represent as a histogram, with each row as a point. This won't be interesting since this will give me three bins of equal size. That's ok for now, so read on!

>>> outer_df
  patient                         cell  product
0   Pat_1               22RV1_PROSTATE       12
1   Pat_1               DU145_PROSTATE       15
2   Pat_1  LN18_CENTRAL_NERVOUS_SYSTEM        9
3   Pat_2               22RV1_PROSTATE       12
4   Pat_2               DU145_PROSTATE       15
5   Pat_2  LN18_CENTRAL_NERVOUS_SYSTEM        9
6   Pat_3               22RV1_PROSTATE       12
7   Pat_3               DU145_PROSTATE       15
8   Pat_3  LN18_CENTRAL_NERVOUS_SYSTEM        9

Desired Result

Graph each row as a point on a histogram, but also be able to pick out a particular set of data (eg all points from all cells would be in purple below, those belonging to justDU145_PROSTATE would be in red, and 22RV1_PROSTATE in blue) and graph this as an overlaid histogram. I've illustrated this with a graphic from the pandas docs:

Overlaid histogram, with three distributions (I only need 2)

Attempt 1

I first tried to use the hist method for DataFrames, but encountered an error, and a blank 4x4 series of histograms.

>>> outer_df.hist()
Traceback (most recent call last):
  File "/usr/lib/python3.3/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/pandas/tools/plotting.py", line 1977, in hist_frame
    ax.hist(data[col].dropna().values, **kwds)
  File "/usr/lib/python3/dist-packages/matplotlib/axes.py", line 8099, in hist
    xmin = min(xmin, xi.min())
TypeError: unorderable types: str() < float()

Attempt 2

Realizing DataFrame.hist() "plots the histograms of the columns on multiple subplots", moved away from this and tried outer_df.plot(kind='hist', stacked=True). Even though I took this directly from the docs, I'm stuck on this error:

>>> outer_df.plot(kind='hist', stacked=True)
Traceback (most recent call last):
  File "/usr/lib/python3.3/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/pandas/tools/plotting.py", line 1612, in plot_frame
    raise ValueError('Invalid chart type given %s' % kind)
ValueError: Invalid chart type given hist

Attempt 3 -- courtesy of @816

>>> outer_df.set_index(['patient', 'cell']).unstack('cell').plot(kind='hist', stacked=True)
Traceback (most recent call last):
  File "/usr/lib/python3.3/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/pandas/tools/plotting.py", line 1612, in plot_frame
    raise ValueError('Invalid chart type given %s' % kind)
ValueError: Invalid chart type given hist
Thomas Matthew
  • 2,826
  • 4
  • 34
  • 58

2 Answers2

1

How about this using the groupby method:

hist_data = { cell: outer_df.ix[inds,'product'] for cell,inds in outer_df.groupby('cell').groups.iteritems() }

Each value in the dict is a Series, corresponding to the cell group. Next, iterate over the cell groups, plotting histograms each time:

for cell in hist_data:
    hist_data[cell].hist(label=cell)
#pylab.legend() # need to call this to make sure the legend shows
dermen
  • 5,252
  • 4
  • 23
  • 34
  • When I attempt the dict comprehension above, I get`AttributeError: 'dict' object has no attribute 'iteritems'` – Thomas Matthew Jul 29 '15 at 03:25
  • Did you include the parenthesis ```iteritems()``` – dermen Jul 29 '15 at 03:29
  • `hist_data = {cell: outer_df.ix[inds,'product'] for cell,inds in outer_df.groupby('cell').groups.iteritems()}` That is copied from the interpreter. – Thomas Matthew Jul 29 '15 at 03:31
  • try just ```items()``` – dermen Jul 29 '15 at 03:32
  • I think maybe your version of pandas groupby objects might not have these dict attributes.. (they do on version 0.16 ). In any case, the info is definitely still there, its just a matter of reading the docs. – dermen Jul 29 '15 at 03:33
  • You could also try ```gb=outer_df.groupby['cell'].groups``` and then do ```items = zip( gb.keys(), gb.values() )``` and ```hist_data = { cell: outer_df.ix[inds,'product'] for cell,inds in items}``` – dermen Jul 29 '15 at 03:41
0

how about:

outer_df.set_index(['patient', 'cell']).unstack('cell').plot(kind='hist', stacked=True)
8one6
  • 13,078
  • 12
  • 62
  • 84