43

This is a very simple and practical question. I have the feeling that it must be a silly detail and that there should be similar questions. I wasn't able to find them tho. If someone does I'll happily delete this one.

The closest I found were these: pandas: iterating over DataFrame index with loc

How to select rows within a pandas dataframe based on time only when index is date and time

anyway, the thing is, I have a datetime indexed panda dataframe as follows:

In[81]: y
Out[81]: 
            PETR4  CSNA3  VALE5
2008-01-01    0.0    0.0    0.0
2008-01-02    1.0    1.0    1.0
2008-01-03    7.0    7.0    7.0

In[82]: y.index
Out[82]: DatetimeIndex(['2008-01-01', '2008-01-02', '2008-01-03'], dtype='datetime64[ns]', freq=None)

Oddly enough, I can't access its values using none of the following methods:

In[83]: y[datetime.datetime(2008,1,1)]
In[84]: y['2008-1-1']
In[85]: y['1/1/2008']

I get the KeyError error.

Even more weird is that the following methods DO work:

In[86]: y['2008']
Out[86]: 
            PETR4  CSNA3  VALE5
2008-01-01    0.0    0.0    0.0
2008-01-02    1.0    1.0    1.0
2008-01-03    7.0    7.0    7.0
In[87]: y['2008-1']
Out[87]: 
            PETR4  CSNA3  VALE5
2008-01-01    0.0    0.0    0.0
2008-01-02    1.0    1.0    1.0
2008-01-03    7.0    7.0    7.0

I'm fairly new to pandas so maybe I'm missing something here?

Community
  • 1
  • 1
Pedro Braz
  • 2,261
  • 3
  • 25
  • 48

3 Answers3

63

pandas is taking what's inside the [] and deciding what it should do. If it's a subset of column names, it'll return a DataFrame with those columns. If it's a range of index values, it'll return a subset of those rows. What is does not handle is taking a single index value.

Solution

Two work around's

1.Turn the argument into something pandas interprets as a range.

df['2008-01-01':'2008-01-01']

2.Use the method designed to give you this result. loc[]

df.loc['2008-01-01']

Link to the documentation

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 5
    df.ix is *deprecated*, use ```df.loc['2008-01-01']``` instead – DmitrySemenov Aug 23 '17 at 22:16
  • @DmitrySemenov thx. I've updated the post. I'm assuming I have a few of these answers that can be updated (-: – piRSquared Aug 23 '17 at 22:47
  • `df.loc` works for me. Without `.loc`, it says that it doesnt accept strings – Nyxynyx Dec 03 '17 at 19:07
  • 2
    your index needs to be of type `pandas.core.indexes.datetimes.DatetimeIndex`. Run `type(df.index)` to see. If not then do `df.index = pd.to_datetime(df.index)` – piRSquared Dec 03 '17 at 19:09
  • @DmitrySemenov and how to index datewise... I mean if I have a series with DatetimeIndex of 365 days, and I want to see dates 2 to 5 of every month, then how to perform slicing / index operation ? – AAYUSH SHAH Jan 30 '21 at 08:17
  • I don't know why the 2nd more elegant solution doesn't work for me (I've even tried multiple string formats including ones that include seconds and even milliseconds, in addition to the date only ISO format), but the 1st range based one works great. – chaostheory Jul 22 '21 at 17:56
6

You can use the to_pydatetime function on your index so thus:

y[y.index.to_pydatetime() == datetime.datetime(2008,1,1)]
Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51
6

Reversing your dataframe allows the indexing to work:

Here is your .csv datafile:

Date,PETR4,CSNA3,VALE5
2008-01-01,0.0,0.0,0.0
2008-01-02,1.0,1.0,1.0
2008-01-03,7.0,7.0,7.0

Use the following incantation to read it into a DataFrame:

>>> a = pd.read_csv('your.csv', index_col=0, parse_dates=True, infer_datetime_format=True)

Then, try to index a row:

>>> a['2008-01-01']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1969, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1976, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1091, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3211, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1759, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)
  File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)
  File "pandas/hashtable.pyx", line 668, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
  File "pandas/hashtable.pyx", line 676, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: '2008-01-01'

You end up with a traceback with KeyError

However, if you reverse it, like this:

>>> b = a[::-1]

Then try the same index, you get the proper result:

>>> b['2008-01-01']
            PETR4  CSNA3  VALE5
Date                           
2008-01-01      0      0      0

I do NOT know why this is the case. Chances are, it has something to do with being a time series one way, but not the other? Someone more knowledgeable should answer that.

Update: By RTFM, I discovered this page:

https://pandas.pydata.org/pandas-docs/stable/timeseries.html

If you find the section titled "Slice vs. Exact Match", there is a warning that explains this behavior. The problem seems to be that for a TimeSeries, an exact match is interpreted as a column name. For unsorted dataframes, this doesn't happen. See the warning box in the section referenced above. I still find this terribly confusing, but there you go...

Edit: Changed the printout of b, which was wrong in the original.

Edit1: Update with explanation in python documentation.

bob_monsen
  • 161
  • 1
  • 5