12

I have a large Pandas DataFrame

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3425100 entries, 2011-12-01 00:00:00 to 2011-12-31 23:59:59
Data columns:
sig_qual    3425100  non-null values
heave       3425100  non-null values
north       3425099  non-null values
west        3425097  non-null values
dtypes: float64(4)

I select a subset of that DataFrame using .ix[start_datetime:end_datetime] and I pass this to a peakdetect function which returns the index and value of the local maxima and minima in two seperate lists. I extract the index position of the maxima and using DataFrame.index I get a list of pandas TimeStamps.

I then attempt to extract the relevant subset of the large DataFrame by passing the list of TimeStamps to .ix[] but it always seems to return an empty DataFrame. I can loop over the list of TimeStamps and get the relevant rows from the DataFrame but this is a lengthy process and I thought that ix[] should accept a list of values according to the docs? (Although I see that the example for Pandas 0.7 uses a numpy.ndarray of numpy.datetime64)

Update: A small 8 second subset of the DataFrame is selected below, # lines show some of the values:

y = raw_disp['heave'].ix[datetime(2011,12,30,0,0,0):datetime(2011,12,30,0,0,8)]
#csv representation of y time-series 
2011-12-30 00:00:00,-310.0
2011-12-30 00:00:01,-238.0
2011-12-30 00:00:01.500000,-114.0
2011-12-30 00:00:02.500000,60.0
2011-12-30 00:00:03,185.0
2011-12-30 00:00:04,259.0
2011-12-30 00:00:04.500000,231.0
2011-12-30 00:00:05.500000,139.0
2011-12-30 00:00:06.500000,55.0
2011-12-30 00:00:07,-49.0
2011-12-30 00:00:08,-144.0

index = y.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-12-30 00:00:00, ..., 2011-12-30 00:00:08]
Length: 11, Freq: None, Timezone: None

#_max returned from the peakdetect function, one local maxima for this 8 seconds period
_max = [[5, 259.0]]

indexes = [x[0] for x in _max]
#[5]

timestamps = [index[z] for z in indexes]
#[<Timestamp: 2011-12-30 00:00:04>]

print raw_disp.ix[timestamps]
#Empty DataFrame
#Columns: array([sig_qual, heave, north, west, extrema], dtype=object)
#Index: <class 'pandas.tseries.index.DatetimeIndex'>
#Length: 0, Freq: None, Timezone: None

for timestamp in timestamps:
    print raw_disp.ix[timestamp]
#sig_qual      0
#heave       259
#north        27
#west        132
#extrema       0
#Name: 2011-12-30 00:00:04

Update 2: I created a gist, which actually works because when the data is loaded in from csv the index columns of timestamps are stored as numpy array of objects which appear to be strings. Unlike in my own code where the index is of type <class 'pandas.tseries.index.DatetimeIndex'> and each element is of type <class 'pandas.lib.Timestamp'>, I thought passing a list of pandas.lib.Timestamp would work the same as passing individual timestamps, would this be considered a bug?

If I create the original DataFrame with the index as a list of strings, querying with a list of strings works fine. It does increase the byte size of the DataFrame significantly though.

Update 3: The error only appears to occur with very large DataFrames, I reran the code on varying sizes of DataFrame ( some detail in a comment below ) and it appears to occur on a DataFrame above 2.7 million records. Using strings as opposed to TimeStamps resolves the issue but increases memory usage.

Fixed In latest github master (18/09/2012), see comment from Wes at bottom of page.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
seumas
  • 512
  • 1
  • 6
  • 17

1 Answers1

19

df.ix[my_list_of_dates] should work just fine.

In [193]: df
Out[193]:
            A  B  C  D
2012-08-16  2  1  1  7
2012-08-17  6  4  8  6
2012-08-18  8  3  1  1
2012-08-19  7  2  8  9
2012-08-20  6  7  5  8
2012-08-21  1  3  3  3
2012-08-22  8  2  3  8
2012-08-23  7  1  7  4
2012-08-24  2  6  0  6
2012-08-25  4  6  8  1

In [194]: row_pos = [2, 6, 9]

In [195]: df.ix[row_pos]
Out[195]:
            A  B  C  D
2012-08-18  8  3  1  1
2012-08-22  8  2  3  8
2012-08-25  4  6  8  1

In [196]: dates = [df.index[i] for i in row_pos]

In [197]: df.ix[dates]
Out[197]:
            A  B  C  D
2012-08-18  8  3  1  1
2012-08-22  8  2  3  8
2012-08-25  4  6  8  1
Wouter Overmeire
  • 65,766
  • 10
  • 63
  • 43
  • Thanks for the example, that was my understanding of how it is supposed to work, I have now provided an example of how it is failing in my original question. – seumas Aug 17 '12 at 09:36
  • What version of pandas are you using? Is it possible to share raw_disp? For me `update` works fine, y.ix[timestamps] (y has DateTimeIndex) gives the expected output (i can`t do raw_disp.ix[timestamps] of course since raw_disp is not available.) – Wouter Overmeire Aug 17 '12 at 11:29
  • Pandas version 0.8.1, I've been trying to reproduce the error on smaller DataFrames but it doesn't occur. When I try it on my large DataFrame of 3 million plus rows I get an Empty DataFrame. I have successfully reproduced the error on a DataFrame of 2888264 rows but it works fine on a DataFrame of 2665621 rows. I could upload the large DataFrame if others wish to reproduce it. – seumas Aug 17 '12 at 14:43
  • I tried reproducing this using `index = pandas.date_range('03/06/2000 00:00', periods=3e6, freq='s')`, and `df = pandas.DataFrame(np.random.randn(3e6, 4), columns=list('ABCD'), index=index)` this works fine. Could you post an issue on GitHub (https://github.com/pydata/pandas/issues) - with code reproducing the issue? – Wouter Overmeire Aug 20 '12 at 07:22
  • 3
    Fixed on GitHub as of today (9/18) – Wes McKinney Sep 18 '12 at 21:09