1

I have a large dataframe where I store various metadata in a multiindex (see also here).

Essentially my dataframe looks like this:

location    zero    A         B         C  and so on
type        zero  MUR  RHE  DUJ  RHE  RHE
name        zero  foo  bar  baz  boo  far
1930-03-01     0  2.1  3.4  9.4  5.4  5.5
1930-04-01     0  3.1  3.6  7.3  6.7  9.5
1930-05-01     0  2.5  9.1  8.0  1.1  8.1
and so on

So that I can easily select for example all DUJ datatypes with mydf.xs('DUJ', level = 'type', axis = 1).

But how can I access the strings in the type index and eliminate doubles and maybe get some statictics?

I am looking for an output like

types('MUR', 'RHE', 'DUJ')

and/or

types:
DUJ 1
MUR 1
RHE 3

giving me a list of the datatypes and how often they occur.

I can access the index with

[In]mytypes = mydf.columns.get_level_values(1)

[In]mytypes
[Out]Index([u'zero', u'MUR', u'RHE', u'DUJ', u'RHE', u'RHE'], dtype='object')

but I cant think of any easy way to do something with this information, especially considering that my real dataset will return 1500 entries. My first idea was a simple mytypes.sort() but apparently I Cannot sort an 'Index' object.

Being able to describe your dataset seems like a rather important thing to me, so I would expect that there is something built in in pandas, but I cant seem to find it. And the MultiIndex documentation seems only to be concerned with constructing and setting indexes, but not analyzing them.

Community
  • 1
  • 1
JC_CL
  • 2,346
  • 6
  • 23
  • 36
  • Won't `mytypes.value_counts() ` give you what you want? – EdChum Sep 07 '15 at 10:29
  • Indeed it does. Just as I expected. There is an easy inbuilt way. I just really suck at finding those… Can you post it as answer? – JC_CL Sep 07 '15 at 10:38

1 Answers1

0

Index objects have a method for this value_counts so you can just call:

mytypes.value_counts()

And this will return the index values in the index and the counts as the values.

Example from your linked question:

In [3]:
header = [np.array(['location','location','location','location2','location2','location2']), 
np.array(['S1','S2','S3','S1','S2','S3'])] 
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header )
df.columns

Out[3]:
MultiIndex(levels=[['location', 'location2'], ['S1', 'S2', 'S3']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [4]:    
df.columns.get_level_values(1).value_counts()

Out[4]:
S1    2
S2    2
S3    2
dtype: int64
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks! Seems like I can do quite a lot with the series I get from `value_counts`. Also a nice way to find errors in my preprocessing. I definitely should not have a datatype called `z)'`. – JC_CL Sep 07 '15 at 11:30