0

I was wondering how pandas handles memory usage in python? I was wondering more specifically how the memory is handled if I set a pandas dataframe query results to a variable. Behind the hood, would it just be some memory addresses to the original dataframe object or would I be cloning all of the data?

I'm afraid of memory ballooning out of control but I have a dataframe that has non-unique fields I can't index it by. It's incredibly slow to query and plot data from it using commands like df[(df[''] == x) & (df[''] == y)].

(They're both integer values in the rows. They're also not unique, hence the fact it returns multiple results.)

I'm very new to pandas anyway, but any insights as to how to handle a situation where I'm looking for the arrays of values where two conditions match would be great too. Right now I'm using an O(n) algorithm to loop through and index it because even that runs faster than the search queries when I need to access the data quickly. Watching my system take twenty seconds on a dataset of only 6,000 rows is foreboding.

Raishin
  • 76
  • 4
  • related: http://stackoverflow.com/questions/23296282/what-rules-does-pandas-use-to-generate-a-view-vs-a-copy – lib Jul 31 '15 at 16:10
  • can you make your example of two conditions match more explicit? Are you doing something like df[df[:].isin([x,y])] ? – lib Jul 31 '15 at 16:14
  • I'm just trying to match an ID and a cycle. They're both integers. – Raishin Jul 31 '15 at 16:17
  • If I understood your problem, the "SQL" way to do it is to merge with another dataset containing the values you want, I don't know if it's faster something like this: ref= pd.Dataframe({'id': [x], 'cycle': [y]}); pd.merge(df, ref, on = ['id','cycle']) . I don't know the general rule for pandas working, probably you are doing twice the work with two different queries – lib Jul 31 '15 at 16:23
  • I figured as much, I just don't know pandas very well yet and I'm also concerned about what happens when the datasets aren't nice and small. There may be millions of rows later. If getting subsets of the set to query means creating objects from the subsets of the dataframe I'd probably exhaust system memory or something unless I'm careful. Then again, pandas might have a dataframe capacity limit set or the operating system will get crankly or something else will go wrong when it expands like that for all I know. Figure that out when it comes to it. – Raishin Jul 31 '15 at 16:30

0 Answers0