3

I am iterating through a large dataframe with multiindex using iterrows. The result is a Series with multiindex. After some profiling, it turned out that most of the time is spent on getting the cell value for the series, so I would like to use the Series.at function, as it is much faster. Unfortunately I haven't found anything in the pandas documentation about this with multiindex.

Here is a simple code:

import numpy as np
import pandas as pd

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

s = pd.Series(np.random.randn(8), index=index)
>>>>s
first  second
bar    one      -0.761968
       two       0.670786
baz    one      -0.193843
       two      -0.251533
foo    one       1.732875
       two       0.538561
qux    one      -1.111480
       two       0.478322
dtype: float64

I have tried s.at[("bar","one")] , s.at["bar","one"], but non of these works.

>>>>s.at[("bar","one")]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python\lib\site-packages\pandas\core\indexing.py", line 2270, in __getitem__
    return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() got multiple values for argument 'takeable'
>>>>s.at["bar","one"]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python\lib\site-packages\pandas\core\indexing.py", line 2270, in __getitem__
    return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() got multiple values for argument 'takeable'

Does anyone have any idea how to use .at in this case?

1 Answers1

2

Use Series.loc:

print (s.loc[("bar","one")])
1.265936258705534

EDIT:

It seems it is bug.

If working with DataFrame it working nice:

np.random.seed(1234)
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

s = pd.Series(np.random.randn(8), index=index)
print (s)
first  second
bar    one       0.471435
       two      -1.190976
baz    one       1.432707
       two      -0.312652
foo    one      -0.720589
       two       0.887163
qux    one       0.859588
       two      -0.636524
dtype: float64

df = s.to_frame('col')
print (df)
                   col
first second          
bar   one     0.471435
      two    -1.190976
baz   one     1.432707
      two    -0.312652
foo   one    -0.720589
      two     0.887163
qux   one     0.859588
      two    -0.636524

print (df.at[("bar","one"), 'col'])
0.47143516373249306
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Loc works, but it is terribly slow: https://stackoverflow.com/questions/37216485/pandas-at-versus-loc –  May 08 '19 at 11:12
  • @RLaszlo - Is possible working with DataFrame? If yes, check edited answer. – jezrael May 08 '19 at 11:23
  • 2
    I have achieved massive improvement. Instead of using iterrows and trying to force .at to get the data from a Series, I am only iterating through the indexes of the whole Dataframe, and using .at. Multiindexing with dataframe works really well. Thx @jezrael –  May 08 '19 at 12:12
  • In addition to the answer above, you could also use IndexSllce and loc on the Series. ___ idx = pd.IndexSlice __ s.loc[idx['bar','one']] __ s.loc[idx[['bar','foo'],['one','two]']]] – run-out May 08 '19 at 12:26
  • 1
    2019-05-08 edit added that this is a bug. It looks like https://github.com/pandas-dev/pandas/issues/26989 has a fix https://github.com/pandas-dev/pandas/pull/32520 that was merged in May 2020. – Tom Brown Sep 21 '20 at 18:05