0

I have a very large dataframe (close to 1 million rows), which has a couple of meta data columns and one single column that contains a long string of triples. One string could look like this:

0,0,123.63;10,360,2736.11;30,270,98.08;...

That is, three values separated by comma and then separated by semicolon. Let us refer to the three values as IN, OUT, MEASURE. Effectively i want to group my data by the original columns + the IN & OUT columns and then sum over the MEASURE column. Since each long string contains roughly 30 triples my dataframe would grow to be ~30 million rows if i simply unstacked the data. Obviously this is not feasible.

So given a set of columns (which may in- or exclude the IN & OUT columns) over which I want to group and then sum my MEASURE data, how would I efficiently strip out the relevant data and sum everything up without blowing up my memory?

My current solution simply loops over each row and then over each triple and keeps a running total of each group I specified. This is very slow, so I am looking for something faster, perhaps vectorised. Any help would be appreciated.

Edit: Sample data below (columns separated by pipe)

DATE|REGION|PRIORITY|PARAMETERS
10-Oct-2016|UK|High|0,0,77.82;30,90,7373.70;
10-Oct-2016|US|Low|0,30,7.82;30,90,733.70;
11-Oct-2016|UK|High|0,0,383.82;40,90,713.75;
12-Oct-2016|NA|Low|40,90,937.11;30,180,98.23;

where PARAMETERS has the form 'IN,OUT,MEASURE;IN,OUT,MEASURE;...'
I basically want to (as an example) create a pivot table where

values=MEASURE
index=DATE, IN
columns=PRIORITY
Phil-ZXX
  • 2,359
  • 2
  • 28
  • 40
  • 1
    It's not clear how do you want to group your data... Could you provide a sample data set (with 3-5 rows in a __text__ form) and desired data set? This will help us to understand what are you trying to achieve and help you... – MaxU - stand with Ukraine Oct 18 '16 at 16:38
  • 1
    Please read [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – root Oct 18 '16 at 16:40
  • Thanks for the sample data set - it's helpful. How would your desired data set would look like (for this particular input data set)? – MaxU - stand with Ukraine Oct 18 '16 at 16:58
  • Thanks, Max, for the formatting. As mentioned, i would ideally like to do a pivot table over the data as described. A lot of the rows are essentially the same (except for the PARAMETERS column of course) and I just want to sum everything up in order to decrease the size of the dataframe. – Phil-ZXX Oct 18 '16 at 17:05

0 Answers0