I am having difficulty figuring out the best way to subset a large data set in Pandas, however my experience with pandas is limited.
My data consists of 4 columns and ~400 million lines. The columns represent datetime, X position, Y position, and then some information about the point. My goal is to subset the DataFrame by X,Y pairs which have a value of 'Stuff' that exceeds a threshold value at least once. To be clear, I want all instances of the X,Y pairs identified, not just those which exceed the threshold. Obviously this needs to be done in a memory efficient manner. I should mention that the data are currently indexed by date (although the dates are not unique).
As an example, if this is my data:
In [6]: data
Out[6]:
Stuff X Y
Date
2004-01-01 2.3 10 3
2004-01-02 5.6 20 2
2004-01-02 14.9 30 4
2004-01-02 10.2 10 3
2004-01-03 3.1 30 4
2004-01-04 2.9 10 5
2004-01-05 3.4 20 4
And I have a threshold of 10, I would expect to get the following as a result since the pairs 10,3 and 30,4 both have an instance of 'Stuff' that exceeds the threshold:
In [7]: data
Out[7]:
Stuff X Y
Date
2004-01-01 2.3 10 3
2004-01-02 14.9 30 4
2004-01-02 10.2 10 3
2004-01-03 3.1 30 4
Any tips would be appreciated. I suspect I may need to change the index to MultiIndex on X,Y, but I'm not sure. I've also looked into using groupby(['X','Y']).max but I don't believe I can subset my original dataframe using these results.
QUESTION UPDATE The solution presented by @jsexauer was what I wanted, however I am suffering from a memory error due to the large size of the data. Any recommendations on how to optimize this process?