67

Which is the most recommended/pythonic way of handling live incoming data with pandas?

Every few seconds I'm receiving a data point in the format below:

{'time' :'2013-01-01 00:00:00', 'stock' : 'BLAH',
 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0}

I would like to append it to an existing DataFrame and then run some analysis on it.

The problem is, just appending rows with DataFrame.append can lead to performance issues with all that copying.

Things I've tried:

A few people suggested preallocating a big DataFrame and updating it as data comes in:

In [1]: index = pd.DatetimeIndex(start='2013-01-01 00:00:00', freq='S', periods=5)

In [2]: columns = ['high', 'low', 'open', 'close']

In [3]: df = pd.DataFrame(index=t, columns=columns)

In [4]: df
Out[4]: 
                    high  low open close
2013-01-01 00:00:00  NaN  NaN  NaN   NaN
2013-01-01 00:00:01  NaN  NaN  NaN   NaN
2013-01-01 00:00:02  NaN  NaN  NaN   NaN
2013-01-01 00:00:03  NaN  NaN  NaN   NaN
2013-01-01 00:00:04  NaN  NaN  NaN   NaN

In [5]: data = {'time' :'2013-01-01 00:00:02', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0}

In [6]: data_ = pd.Series(data)

In [7]: df.loc[data['time']] = data_

In [8]: df
Out[8]: 
                    high  low open close
2013-01-01 00:00:00  NaN  NaN  NaN   NaN
2013-01-01 00:00:01  NaN  NaN  NaN   NaN
2013-01-01 00:00:02    4    3    2     1
2013-01-01 00:00:03  NaN  NaN  NaN   NaN
2013-01-01 00:00:04  NaN  NaN  NaN   NaN

The other alternative is building a list of dicts. Simply appending the incoming data to a list and slicing it into smaller DataFrames to do the work.

In [9]: ls = []

In [10]: for n in range(5):
   .....:     # Naive stuff ahead =)
   .....:     time = '2013-01-01 00:00:0' + str(n)
   .....:     d = {'time' : time, 'stock' : 'BLAH', 'high' : np.random.rand()*10, 'low' : np.random.rand()*10, 'open' : np.random.rand()*10, 'close' : np.random.rand()*10}
   .....:     ls.append(d)

In [11]: df = pd.DataFrame(ls[1:3]).set_index('time')

In [12]: df
Out[12]: 
                        close      high       low      open stock
time                                                             
2013-01-01 00:00:01  3.270078  1.008289  7.486118  2.180683  BLAH
2013-01-01 00:00:02  3.883586  2.215645  0.051799  2.310823  BLAH

or something like that, maybe processing the input a little bit more.

Community
  • 1
  • 1
Marcelo MD
  • 1,769
  • 1
  • 18
  • 23
  • 7
    a lot of this depends on exactly what you are going to then do with the data, i.e. do you need a frame of the ENTIRE series for example. Another option is to append to a ``HDFStore``; then select what you need (you have synchronize the read/write a bit though) – Jeff May 24 '13 at 18:11
  • The usual: moving averages, Bollinger bands, plotting, etc. I would like to keep the whole data history mostly for plotting reasons. I'll look into HDFStores =) – Marcelo MD May 26 '13 at 22:28
  • 1
    the pre allocated dataframe should work. Quick check on the memory usage of df... 1,000,000 data points = about 50 years of second data(assuming business days and 10 hour trading days. so roughly 5 years of seconds with 1,000 stocks will use about 1G memory. Depending on application and resource constraints that might be acceptable or not. – Joop May 28 '13 at 10:06
  • I'm not really worried about total memory usage right now. I'm starting small with less than 50k points total. I'm more worried about allocating and reallocating over and over. It just seems wasteful. And this case (appending data and analyzing it live) is so common that I thought there should be a well defined pattern I was not aware of. – Marcelo MD May 28 '13 at 17:38
  • 1
    Will Comment on this "Problem is, just appending rows with DataFrame.append can lead to performance issues with all that copying". Have you tried converting the required fields into "Tuples" and then appedning it to the dataframe. Let me know if you need help in that. – LonelySoul May 30 '13 at 16:18
  • 2
    Like others have mentioned, unless you need to urgently act on things in realtime (which I doubt is the case if you're using python and only receiving OHLC data, and only every few seconds), breaking the process into two stages of 1. data retrieval/storage, and 2. data processing would make life a lot easier. I have a script that receives hundreds of OHLC bars a second; saving them first to a list of dicts and periodically appending that to an HDFStore worked for me - I'm sure it could be further optimized but already that improved CPU usage many times vs appending each row to a df or store. – fantabolous Aug 03 '14 at 04:39
  • 1
    There is a new project just heating up on [streaming dataframes](http://matthewrocklin.com/blog/work/2017/10/16/streaming-dataframes-1) by Matthew Rocklin – Ted Petrou Nov 04 '17 at 18:30

2 Answers2

27

I would use HDF5/pytables as follows:

  1. Keep the data as a python list "as long as possible".
  2. Append your results to that list.
  3. When it gets "big":
    • push to HDF5 Store using pandas io (and an appendable table).
    • clear the list.
  4. Repeat.

In fact, the function I define uses a list for each "key" so that you can store multiple DataFrames to the HDF5 Store in the same process.


We define a function which you call with each row d:

CACHE = {}
STORE = 'store.h5'   # Note: another option is to keep the actual file open

def process_row(d, key, max_len=5000, _cache=CACHE):
    """
    Append row d to the store 'key'.

    When the number of items in the key's cache reaches max_len,
    append the list of rows to the HDF5 store and clear the list.

    """
    # keep the rows for each key separate.
    lst = _cache.setdefault(key, [])
    if len(lst) >= max_len:
        store_and_clear(lst, key)
    lst.append(d)

def store_and_clear(lst, key):
    """
    Convert key's cache list to a DataFrame and append that to HDF5.
    """
    df = pd.DataFrame(lst)
    with pd.HDFStore(STORE) as store:
        store.append(key, df)
    lst.clear()

Note: we use the with statement to automatically close the store after each write. It may be faster to keep it open, but if so it's recommended that you flush regularly (closing flushes). Also note it may be more readable to have used a collections deque rather than a list, but the performance of a list will be slightly better here.

To use this you call as:

process_row({'time' :'2013-01-01 00:00:00', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0},
            key="df")

Note: "df" is the stored key used in the pytables store.

Once the job has finished ensure you store_and_clear the remaining cache:

for k, lst in CACHE.items():  # you can instead use .iteritems() in python 2
    store_and_clear(lst, k)

Now your complete DataFrame is available via:

with pd.HDFStore(STORE) as store:
    df = store["df"]                    # other keys will be store[key]

Some comments:

  • 5000 can be adjusted, try with some smaller/larger numbers to suit your needs.
  • List append is O(1), DataFrame append is O(len(df)).
  • Until you're doing stats or data-munging you don't need pandas, use what's fastest.
  • This code works with multiple key's (data points) coming in.
  • This is very little code, and we're staying in vanilla python list and then pandas dataframe...

Additionally, to get the up to date reads you could define a get method which stores and clears before reading. In this way you would get the most up to date data:

def get_latest(key, _cache=CACHE):
    store_and_clear(_cache[key], key)
    with pd.HDFStore(STORE) as store:
        return store[key]

Now when you access with:

df = get_latest("df")

you'll get the latest "df" available.


Another option is slightly more involved: define a custom table in vanilla pytables, see the tutorial.

Note: You need to know the field-names to create the column descriptor.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • I guess I would also say: **pandas is not a database**. If you want real time inserts and analysis use a database, if you can handle some delay (e.g. `max_len` coming in, OR running `store_and_clear` before you read in df from the store and do your analysis... in fact, the latter is not a bad idea. – Andy Hayden Dec 15 '15 at 06:31
  • Thanks. But with this method, you can't use pandas on the *latest* data of the last few seconds. Let's say I insert data today at 15:23:00, 15:23:01, 15:23:02, 15:23:04. Then if I want to access `df['2015-12-15 15:23:01', '2015-12-15 15:23:03']`, I can't. Then it means I have to write my own code to filter rows by time in `cache`. this means we have to reinvent the wheel! – Basj Dec 15 '15 at 16:06
  • @Basj yes you can! See the `get_latest` method. But yes before you start analysis you have to update df using that method, there's no choice if you want something efficient for appending... DataFrame is NOT the right data-structure for appends. – Andy Hayden Dec 15 '15 at 16:55
  • @AndyHayden Hey how does this method compare against vanilla pytables approach in terms of performance? Or is it very case specific? If I am not wrong, HDFStore uses pytables internally. – Shravan Jun 11 '16 at 07:58
  • @Shravan HDFStore does use pytables internally, I've have only good things to say about performance, but it could be there are some edge cases/examples that might perform better when hand-optimized (or written in vanilla pytables). IMO The pandas api is super easy to get great perf with hdf5. – Andy Hayden Jun 13 '16 at 04:35
10

You are actually trying to solve two problems: capturing real-time data and analyzing that data. The first problem can be solved with Python logging, which is designed for this purpose. Then the other problem can be solved by reading that same log file.

Brent Washburne
  • 12,904
  • 4
  • 60
  • 82
  • 2
    I dont really understand why logging can help capturing the time series data? did you mean store the time series as string and then parse the string? then why we need logging at all? shouldnt we just convert them into a proper data structure while they are incoming? – John Aug 26 '13 at 15:11
  • 8
    Logging is specifically designed to help capture time series data. It will handle all the buffering and file management for you. Pandas is designed to read large data files efficiently. In tandem, they can solve your data management problem. Plus, you have a copy of the data so you can double-check your work later. Here is a related post: http://stackoverflow.com/q/14262433/584846 – Brent Washburne Aug 26 '13 at 16:29
  • 3
    If you care about the data at all, then yes, use logging to build a very robust system that guarantees that everything is saved. Then separately have an analytics app that reads the logs in chunks every once in a while and appends the chunks to a DataFrame. If you don't care about the data and can always start over, then just aggregate the incoming data into chunks and append 100 or 1000 rows at a time. If you absolutely need to append the rows asap, at least append 1000 empty rows to allocate memory and then manually insert the data as if you just had numpy arrays. – Sergey Orshanskiy Apr 13 '15 at 01:31
  • 2
    Could somebody please explain the relative pros and cons of this Python logging solution, with respect to the HDF5/PyTables solution given in another answer? The latter seems more useful for online analysis, due to its fancy indexing capabilities. Thanks. – James Paul Turner Oct 22 '17 at 17:23
  • Since real-time data is usually big in size and keeps increasing in size, same increase in size will be responsible of delay in reading file which will eventually delay the whole process. Thanks. – Upasana Mittal Aug 11 '18 at 08:46