0

I'm having a problem in Pandas with the TimeStamp selections.

For what I see that is a problem that some other people have (Selecting a subset of a Pandas DataFrame indexed by DatetimeIndex with a list of TimeStamps) but the developers of Pandas unfortunately refuse to accept it as a bug (https://github.com/pydata/pandas/issues/2437).

In any case I couldn't follow the work around proposed for the SO post I quote above, since my data doesn't come in a CSV file, but in a number of lists (actually I got it from the internet trough JSON and convert that to lists).

The data I got is something like this:

the_dataTransactions
[{u'date': u'1365100630', u'tid': 240264, u'price': u'132.58', u'amount': u'1.28309000'}, {u'date': u'1365100630', u'tid': 240263, u'price': u'132.58', u'amount': u'1.20294000'}, {u'date': u'1365100629', u'tid': 240262, u'price': u'132.58', u'amount': u'0.90893940'}]

And I convert it to:

transactionsDate
[datetime.datetime(2013, 4, 4, 19, 37, 10), datetime.datetime(2013, 4, 4, 19, 37, 10), datetime.datetime(2013, 4, 4, 19, 37, 9)]

And I also tried this, but the error in the result when I try to select a data range was the same:

transactionsDate
[<Timestamp: 2013-04-04 19:37:10>, <Timestamp: 2013-04-04 19:37:10>, <Timestamp: 2013-04-04 19:37:09>]

And the tid, price and amount where also added to a data frame like:

>>> transactionsDF.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 73 entries, 2013-04-04 19:37:10 to 2013-04-04 19:22:49
Data columns:
tid       73  non-null values
price     73  non-null values
amount    73  non-null values
dtypes: float64(2), int64(1)

>>> transactionsDF.head()
                        tid   price    amount
2013-04-04 19:37:10  240264  132.58  1.283090
2013-04-04 19:37:10  240264  132.58  1.283090
2013-04-04 19:37:10  240263  132.58  1.202940
2013-04-04 19:37:09  240262  132.58  0.908939
2013-04-04 19:37:09  240261  132.59  0.213051

But, when I try to choose a data range using the normal notation, I get the same error that is reported in the other post:

>>> transactionsDF['2013-04-03 18:00:00':'2013-04-04 19:00:00']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 1951, in __getitem__
    indexer = self.ix._convert_to_indexer(key, axis=0)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/indexing.py", line 478, in _convert_to_indexer
    i, j = labels.slice_locs(start, stop)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tseries/index.py", line 1153, in slice_locs
    start_loc = self._get_string_slice(start).start
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tseries/index.py", line 1143, in _get_string_slice
    loc = self._partial_date_slice(reso, parsed)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tseries/index.py", line 1041, in _partial_date_slice
    raise TimeSeriesError('Partial indexing only valid for ordered '
pandas.tseries.index.TimeSeriesError: Partial indexing only valid for ordered time series.

My data seems to be in a ordered time series. Can you think about a work around for this Pandas glitch in this particular case?

UPDATE (Solved?): I found a way that is so simple that I'm not completely sure it will give the correct answer everytime, but at least for a small data frame it's working. The code it's just:

transactionsDF = transactionsDF.sort_index()

And after this is seems to be working fine and allow me to choose a data range like I used to with other data: transactionsDF['2013-04-04 19:30':'2013-04-04 19:35']

Perhaps someone more knowledgeable might validate or unvalidate this workaround.

Community
  • 1
  • 1
jbssm
  • 6,861
  • 13
  • 54
  • 81
  • 1
    this was fixed for 0.11-dev (give it a try). was a buglet! https://github.com/pydata/pandas/pull/3136 – Jeff Apr 04 '13 at 19:35
  • FYI looking at the head your data does NOT appear to be ordered – Jeff Apr 04 '13 at 20:05
  • @Jeff It's ordered, but it's ordered from the most recent to the most ancient. To my surprise, quite unexpectedly, I used a Pandas function that caught my eye: sort_index() and after this is seems to work like other routines I wrote in Pandas and allow me selection. – jbssm Apr 04 '13 at 21:00
  • great, ordered allows fast searching, but we made unordered work as well – Jeff Apr 04 '13 at 21:17
  • @Jeff Excellent that you are working on Pandas Jeff. It's a really great library and I use it a lot for my research work. – jbssm Apr 04 '13 at 21:32

1 Answers1

1

I think there is no truly elegant solution. Pandas does not like duplicate indexes. (At least the slightly old version that I have.) You can create DataFrames with duplicate indexes but you can't access their content comfortably.

Therefore you should put the dates into a separate column. Then you access the interesting rows with comparison operators on the dates, and fancy indexing:

In [1]: import pandas as pd

In [5]: import datetime

In [15]: f1 = pd.DataFrame([{u'date': u'1365100630', u'tid': 240264, u'price': u'132.58', u'amount': u'1.28309000'}, {u'date': u'1365100630', u'tid': 240263, u'price': u'132.58', u'amount': u'1.20294000'}, {u'date': u'1365100629', u'tid': 240262, u'price': u'132.58', u'amount': u'0.90893940'}])

In [16]: f1["dates"] = [datetime.datetime(2013, 4, 4, 19, 37, 10), datetime.datetime(2013, 4, 4, 19, 37, 10), datetime.datetime(2013, 4, 4, 19, 37, 9)]

In [17]: f1
Out[17]: 
       amount        date   price     tid                dates
0  1.28309000  1365100630  132.58  240264  2013-04-04 19:37:10
1  1.20294000  1365100630  132.58  240263  2013-04-04 19:37:10
2  0.90893940  1365100629  132.58  240262  2013-04-04 19:37:09

In [25]: matching = (f1["dates"] >= datetime.datetime(2013, 4, 4, 19, 37, 10)) & (f1["dates"] < datetime.datetime(2013, 4, 4, 20, 00, 00))

In [26]: f1.ix[matching]
Out[26]: 
       amount        date   price     tid                dates
0  1.28309000  1365100630  132.58  240264  2013-04-04 19:37:10
1  1.20294000  1365100630  132.58  240263  2013-04-04 19:37:10

You can also use f1[matching] to access the interesting rows, but I find it less clear, because f1["foo"] is used to access columns.

Eike
  • 2,205
  • 17
  • 10
  • Thank you for your help. I found a very simple way to do it... altough I'm not sure if it will give the correct values all the time. I just used transactionsDF = transactionsDF.sort_index() and now it all works like it used to in other examples. Like I said, I'm not sure if this gives the correct answer everytime, but so far it seems to give me the correct data when I choose a time range. – jbssm Apr 04 '13 at 21:02