1

I am having difficulty figuring out the best way to subset a large data set in Pandas, however my experience with pandas is limited.

My data consists of 4 columns and ~400 million lines. The columns represent datetime, X position, Y position, and then some information about the point. My goal is to subset the DataFrame by X,Y pairs which have a value of 'Stuff' that exceeds a threshold value at least once. To be clear, I want all instances of the X,Y pairs identified, not just those which exceed the threshold. Obviously this needs to be done in a memory efficient manner. I should mention that the data are currently indexed by date (although the dates are not unique).

As an example, if this is my data:

In [6]: data
Out[6]: 
            Stuff   X  Y
Date                    
2004-01-01    2.3  10  3
2004-01-02    5.6  20  2
2004-01-02   14.9  30  4
2004-01-02   10.2  10  3
2004-01-03    3.1  30  4
2004-01-04    2.9  10  5
2004-01-05    3.4  20  4

And I have a threshold of 10, I would expect to get the following as a result since the pairs 10,3 and 30,4 both have an instance of 'Stuff' that exceeds the threshold:

In [7]: data
Out[7]: 
            Stuff   X  Y
Date                    
2004-01-01    2.3  10  3
2004-01-02   14.9  30  4
2004-01-02   10.2  10  3
2004-01-03    3.1  30  4

Any tips would be appreciated. I suspect I may need to change the index to MultiIndex on X,Y, but I'm not sure. I've also looked into using groupby(['X','Y']).max but I don't believe I can subset my original dataframe using these results.

QUESTION UPDATE The solution presented by @jsexauer was what I wanted, however I am suffering from a memory error due to the large size of the data. Any recommendations on how to optimize this process?

2 Answers2

1

You were right with your hunch that you need to use multi-indexing:

>>> df = df.reset_index()
>>> df = df.set_index(['X','Y'])
>>> df.ix[ df[df['Stuff']>10].index.unique() ]
            Date  Stuff
X  Y                   
30 4  2004-01-02   14.9
   4  2004-01-03    3.1
10 3  2004-01-01    2.3
   3  2004-01-02   10.2
jsexauer
  • 681
  • 2
  • 12
  • 22
  • This is a great answer. However, the size of my DataFrame seems to be problematic (i.e., memory error). Should I expect any improvement by updating my HDF5 file with the new index before proceeding with the calculation? Any other suggestions? – Cloudwalker Oct 04 '13 at 19:34
  • 1
    For any future users of this technique I would also point out that if a single pair has more than one threshold exceedence you will want to use df.ix[ df[df['Stuff']>10].index.unique() ] instead so you don't double count. – Cloudwalker Oct 04 '13 at 21:21
  • I would read the data in chunks and process it to get all the indexes you'll need. Then query only those index out of the HDF5 file. – jsexauer Oct 05 '13 at 11:50
  • @Cloudwalker You may find some of the stuff discussed here helpful: http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – jsexauer Oct 07 '13 at 14:08
-1

You could select by defining the condition like so,

df[df.stuff > 10]

========== Edited to add screenshots of actual run=======================

idx = ['2004-01-02','2004-01-02','2004-01-02','2004-01-03','2004-01-04','2004-01-05','2004-01-05']
import pandas
df = pandas.DataFrame(index = idx)
df['x'] = [ 10,20,30,10,30,10,20]
df['y'] = [ 4,5,6,7,3,5,7]
df['stuff'] = [2.6,5.6,14.9,10.2,3.1,2.9,3.4]
print df

yields,

             x  y
2004-01-02  10  4
2004-01-02  20  5
2004-01-02  30  6
2004-01-03  10  7
2004-01-04  30  3
2004-01-05  10  5
2004-01-05  20  7

To select with conditions, do,

df[df.stuff > 10]

This yields,

                 x   y  stuff
2004-01-02   30  6   14.9
2004-01-03   10  7   10.2
nitin
  • 7,234
  • 11
  • 39
  • 53
  • I need to select all X,Y pairs where at least one instance of "Stuff" for a pair is over is > 10.The example shows two values of "Stuff" > 10 but the result needs to return all data for the corresponding X,Y pairs (4 lines in this case). – Cloudwalker Oct 04 '13 at 17:09
  • I don't think the Python `or` will work here, because that will return the first truelike argument. Something like `(data.X > 10) | (data.Y > 10)` should, IIUC. – DSM Oct 04 '13 at 17:23
  • When I try it, I get the error message deliberately designed to avoid the possibility of bugs due to thinking that `or` will work, namely `ValueError: The truth value of an array is ambiguous. Use a.empty, a.item(), a.any() or a.all().`. – DSM Oct 04 '13 at 17:34
  • -1 The question wasn't ambiguous. Try your solution. It doesn't return what the OP wanted. Also, as @DSM says, `or` is not a valid operator for numpy arrays (nor pandas data structures, which are backed by numpy arrays). – Phillip Cloud Oct 04 '13 at 17:35
  • Also the threshold is for the `Stuff` column not `X` or `Y`. – Phillip Cloud Oct 04 '13 at 17:38
  • The method here is not quite what I was looking for. @jsexauer nailed it. – Cloudwalker Oct 04 '13 at 18:18