I have a very large dataframe (close to 1 million rows), which has a couple of meta data columns and one single column that contains a long string of triples. One string could look like this:
0,0,123.63;10,360,2736.11;30,270,98.08;...
That is, three values separated by comma and then separated by semicolon. Let us refer to the three values as IN, OUT, MEASURE. Effectively i want to group my data by the original columns + the IN & OUT columns and then sum over the MEASURE column. Since each long string contains roughly 30 triples my dataframe would grow to be ~30 million rows if i simply unstacked the data. Obviously this is not feasible.
So given a set of columns (which may in- or exclude the IN & OUT columns) over which I want to group and then sum my MEASURE data, how would I efficiently strip out the relevant data and sum everything up without blowing up my memory?
My current solution simply loops over each row and then over each triple and keeps a running total of each group I specified. This is very slow, so I am looking for something faster, perhaps vectorised. Any help would be appreciated.
Edit: Sample data below (columns separated by pipe)
DATE|REGION|PRIORITY|PARAMETERS
10-Oct-2016|UK|High|0,0,77.82;30,90,7373.70;
10-Oct-2016|US|Low|0,30,7.82;30,90,733.70;
11-Oct-2016|UK|High|0,0,383.82;40,90,713.75;
12-Oct-2016|NA|Low|40,90,937.11;30,180,98.23;
where PARAMETERS has the form 'IN,OUT,MEASURE;IN,OUT,MEASURE;...'
I basically want to (as an example) create a pivot table where
values=MEASURE
index=DATE, IN
columns=PRIORITY