0

I have a series of dataframes inside a dataframe.

The top level dataframe is structured like this :

    24hr   48hr   72hr
D1  x      x      x
D2  x      x      x 
D3  x      x      x

In each case x is a dataframe created with pandas.read_excel()

One of the columns in each x dataframe has the title 'Average Vessels Length' and there are three entries (i.e. rows, indices) in that column.

What I want to return is the mean value for the column 'Average Vessels Length'. I'm also interested in how to return an particular cell in that column. I know there's a .mean method for pandas dataframes, but I can't figure out the indexing syntax to use it.

Below is an example

import pandas as pd

a = {'Image name' : ['Image 1', 'Image 2', 'Image 3'], 'threshold' : [20, 25, 30], 'Average Vessels Length' : [14.2, 22.6, 15.7] }
b = pd.DataFrame(a, columns=['Image name', 'threshold', 'Average Vessels Length'])

c = pd.DataFrame(index=['D1','D2','D3'], columns=['24hr','48hr','72hr'])
c['24hr']['D1'] = a
c['48hr']['D1'] = a
c['72hr']['D1'] = a
c['24hr']['D2'] = a
c['48hr']['D2'] = a
c['72hr']['D2'] = a
c['24hr']['D3'] = a
c['48hr']['D3'] = a
c['72hr']['D3'] = a

This returns the mean of the values in the column 'Average Vessels Length' :

print b['Average Vessels Length'].mean()

This returns all the values in 24hr, D1, 'Average Vessels Length'

print c['24hr']['D1']['Average Vessels Length']

This doesn't work :

print c['24hr']['D1']['Average Vessels Length'].mean()

And I can't figure out how to access any particular value in c['24hr']['D1']['Average Vessels Length']

Ultimately I want to take the mean from each column of Dx['Average Vessels Length'].mean() and divide it by the corresponding D1['Average Vessels Length'].mean()

Any help would be greatly appreciated.

agf1997
  • 2,668
  • 4
  • 21
  • 36
  • Can you try to put together some example data? Ideally you want a few lines of code that people can copy-paste into a console and have a dataframe that they can try things out with. – Marius Mar 30 '15 at 00:14
  • @Marius not easy to give a functional example as I'm pulling data in from an external .xls file with many columns. – agf1997 Mar 30 '15 at 00:29
  • You need to create a minimal example that demonstrates the problem so that people have something to work with, not necessarily your actual data but probably something put together with some for loops and some use of `numpy.random.randn()`, see [here](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for some tips. – Marius Mar 30 '15 at 00:36
  • @Marius Example added – agf1997 Mar 30 '15 at 01:27
  • 1
    Oh gosh, maybe you have a good reason, but I think you would find it much easier to store everything in a single dataframe and then use an index to classify each row as D1, D2, ... and 24hr, 48hr, ... That is just how pandas is set up to work and will generally lead to the most convenient solutions. Don't get me wrong, it is fine to do it how you do, I just think it leads to harder solutions for the most part. – JohnE Mar 30 '15 at 01:42
  • @JohnE can you provide an example? I'm using pandas.read_excel() to get the data for each Dx / hr and each call to that returns a dataframe. – agf1997 Mar 30 '15 at 02:16
  • 1
    just use ```pd.concat()```. I would put them in a list when reading from excel then just concat that list into a single dataframe (in one line). – JohnE Mar 30 '15 at 02:32
  • @JohnE I guess I'm not following. If it's all together in one list / line how would I differentiate between the various Dx and hrs? – agf1997 Mar 30 '15 at 02:52
  • 1
    @afg1997 very briefly- read your individual dataframes into a dict keyed with tuples like `(24hr, D1)`, so `all_dict[('24hr', 'D1')] = pd.read_excel(current_file)`. Then call `pd.concat()` on the dict to create a combined dataframe with the hours and Dx vars in the index. – Marius Mar 30 '15 at 03:14
  • @Marius Wow!!! So much easier to set the data up this way! Thanks – agf1997 Mar 30 '15 at 05:02

1 Answers1

0

I'm assuming that since you said each element of your big dataframe was a dataframe, your example data should have been:

import pandas as pd

a = {'Image name' : ['Image 1', 'Image 2', 'Image 3'], 'threshold' : [20, 25, 30], 'Average Vessels Length' : [14.2, 22.6, 15.7] }
b = pd.DataFrame(a, columns=['Image name', 'threshold', 'Average Vessels Length'])

c = pd.DataFrame(index=['D1','D2','D3'], columns=['24hr','48hr','72hr'])
c['24hr']['D1'] = b
c['48hr']['D1'] = b
c['72hr']['D1'] = b
c['24hr']['D2'] = b
c['48hr']['D2'] = b
c['72hr']['D2'] = b
c['24hr']['D3'] = b
c['48hr']['D3'] = b
c['72hr']['D3'] = b

To get the mean of each individual cell you can use applymap, which maps a function to each cell of the DataFrame:

cell_means = c.applymap(lambda e: e['Average Vessels Length'].mean())
cell_means
Out[14]: 
    24hr  48hr  72hr
D1  17.5  17.5  17.5
D2  17.5  17.5  17.5
D3  17.5  17.5  17.5

And once you have those yo can get the column means etc. and go on to normalize by the mean:

col_means = cell_means.mean(axis=0)
col_means
Out[11]: 
24hr    17.5
48hr    17.5
72hr    17.5
dtype: float64
Marius
  • 58,213
  • 16
  • 107
  • 105
  • Thanks! This seems to work but I'm having trouble getting by the next step now. I want to normalize the data to the first row so I can plot a line graph for each Dx. In other words I want to divide each row of cell_means by the first row of cell_means. – agf1997 Mar 30 '15 at 02:17