2

I have a huge panel data, more than 10 GB. The data looks like the following:

ID    Start_time  Factor  End_time
1001     1611      0.12   1861
1001     1612     -0.01   1862
...
1001     1861      0.57   2111
1001     1862      0.06   2112
...
1002     1200      0.01   1450
1002     1201      0.52   1451
...
1002     1450     -0.21   1700
1002     1451      0.30   1701
...

The data is sorted by ID and Start_time value. I want to calculate the sum of Factor that for each ID from Start_time to the corresponding End_time.

An example of output:

ID    Start_time  Factor  End_time  Cumulative_factor
1001     1611      0.12   1861      0.12+(-0.01)+...+0.57
1001     1612     -0.01   1862      -0.01+...+0.57+0.06
...
1001     1861      0.57   2111      0.57+0.06+...
1001     1862      0.06   2112      0.06+...
...
1002     1200      0.01   1450      0.01+0.52+...+(-0.21)
1002     1201      0.52   1451      0.52+...+(-0.21)+0.30
...
1002     1450     -0.21   1700      -0.21+0.30+...
1002     1451      0.30   1701      0.30+...
...

Since I have more than 10 million observation, Is there a efficient way to calculate it?

Brad
  • 569
  • 1
  • 4
  • 8
  • can you show your desired output? – Julien Spronck Apr 24 '15 at 20:45
  • the following article demonstrates how to calculate on a pandas `DataFrame` that is too big to fit into memory: https://plot.ly/ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/ – aensm Apr 24 '15 at 20:52
  • Since the data is sorted and looks like `End_time-Start_time == 250`, you could try rolling window sum? – Zero Apr 24 '15 at 20:54
  • Thank you@JohnGalt, yes, but that is only part of the data , most of them still have different window. – Brad Apr 24 '15 at 21:16
  • @JulienSpronck, The example of output is the desired output I want, could you please let me know if you still need something more? – Brad Apr 24 '15 at 21:19
  • @Brad sorry I totally missed it :-) – Julien Spronck Apr 24 '15 at 21:20
  • I don't think there is going to be an inexpensive way to do this. I guess you could reindex to normalize the windowsize, unclear how expensive that would be (depends on your data). – Andy Hayden Apr 24 '15 at 21:48
  • possible duplicate of ["Large data" work flows using pandas](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) – Alexander Apr 24 '15 at 21:58

2 Answers2

0

Note: part of this is a reversed cumsum, unfortunately there is no kwarg argument for cumsum (and friends) to reverse, I think this could make a good enhancement.

You can reverse the DataFrame quite cheaply with iloc, though:

df.iloc[::-1].cumsum().iloc[::-1]

This is very similar to a problem with currently open/closed tickets. Rather than perform this on a row-by-row basis the trick is to do it in two goes on the group. Sum up the open tickets, subtract them from the closed tickets (subtraction gets you the current open tickets).

In [11]: df
Out[11]:
     ID  Start_time  Factor  End_time
0  1001        1611    0.12      1861
1  1001        1612   -0.01      1862
2  1001        1861    0.57      2111
3  1001        1862    0.06      2112

Now, what we can do is cumsum upwards and cumsum downwards, and take the difference:

In [12]: open = df.set_index("Start_time")["Factor"].cumsum()

In [13]: closed = df.set_index("End_time")["Factor"].cumsum()

In [14]: open.sub(closed, fill_value=0)
Out[14]:
1611    0.12
1612    0.11
1861    0.56
1862    0.63
2111   -0.68
2112   -0.74
Name: Factor, dtype: float64

Which isn't quite what you want, but should punt you in the right direction.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Yes, I think your method solves part of the problem, that is if the time gap is the same, then I can use this method. Thanks – Brad Apr 29 '15 at 13:41
  • @Brad I thought I said above, I don't think it's clear what you actually want here. It seems to me very similar to this example. Where is the "gap" involved? – Andy Hayden Apr 30 '15 at 00:41
0

Sorry if I come across as a bit of a smartass here. "Big Data" is per definition data which does not fit in memory, which 10GB does. This also means that pandas per definition, pandas cannot handle "Big data", since it does everything in memory.

The proper way to handle a big data problem is to use map/reduce. The problem you are having can easily be solved with map/reduce. Since your data is already sorted on the start_time you say, you could easily just stream it through a reducer to get the cumsum for each start_time.

If that sounds like black magic to you, don't worry, it's not.

  1. Save the data as CSV if you don't already have it, gzip it to make it streamable with zcat.
  2. Write a reducer, just a python file, which reads stdin continuously.
  3. Keep summing the data in a single variable
  4. Print that variable every time the key changes
  5. zcat the file, use | head to just see that it does what you want and keep developing it until it fits your purpose.

That should get you the basic functionality, and this method is the most memory efficient method you can use, since it never actually holds the whole file in memory, only one record at the time.

If you are not so apt with unix streams, this will all blow your mind and you will love it in no time.

TLDR; Pandas is the wrong tool, read up about Map/Reduce

firelynx
  • 30,616
  • 9
  • 91
  • 101