5

I have a Pandas dataframe and I am continually appending a row of data each second as below.

df.loc[time.strftime("%Y-%m-%d %H:%M:%S")] = [reading1, reading2, reading3]
>>>df
                     sensor1 sensor2 sensor3
2015-04-14 08:50:23    5.4     5.6     5.7
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4

If I continue this, eventually I am going to start experiencing memory issues (Each time it will call the whole DataFrame).

I only need to keep X rows of the data. i.e. after the operation, it will be:

>>>df
                     sensor1 sensor2 sensor3
(this row is gone)
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4
2015-04-14 08:50:27    5.2     5.4     5.6

Is there a way I can specify a maximum number of rows, so that when any subsequent rows are added, the oldest row is deleted at the same time WITHOUT a "Check length of DataFrame, If length of DataFrame > X, Remove first row, Append new row"?

Like this, but for a Pandas DataFrame: https://stackoverflow.com/a/10155753/4783578

Community
  • 1
  • 1
ps.george
  • 53
  • 1
  • 1
  • 7
  • Unclear why you need to keep appending but if you want to impose a limit why not just check `len(df)` and not append once the len reaches your limit? – EdChum Apr 13 '15 at 15:25
  • The other thing you could is after appending and if the length is larger than your limit then you just assign a slice so `df = df.iloc[-500:]` would assign back to df the last 500 rows – EdChum Apr 13 '15 at 15:27
  • @EdChum Something like a buffer dataframe of `len(df)==X` whenever new row of data comes in, the older or first row is removed, like FIFO, to preserve the buffer size. – Zero Apr 13 '15 at 15:27
  • Instead of appending, why not replace the old row with the new row? You probably need to sort each time, but I image that's still going to be faster than actions that expand and shrink the dataframe. And as always, a small example dataframe with code makes it easier to get a good answer. – JohnE Apr 13 '15 at 17:19

3 Answers3

3

One way would be to pre-allocate the rows, and replace the values cyclically.

# Say we to limit to a thousand rows
N = 1000

# Create the DataFrame with N rows and 5 columns -- all NaNs
data = pd.DataFrame(pd.np.empty((N, 5)) * pd.np.nan) 

# To check the length of the DataFrame, we'll need to .dropna().
len(data.dropna())              # Returns 0

# Keep a running counter of the next index to insert into
counter = 0

# Insertion always happens at that counter
data.loc[counter, :] = pd.np.random.rand(5)

# ... and increment the counter, but when it exceeds N, set it to 0
counter = (counter + 1) % N

# Now, the DataFrame contains one row
len(data.dropna())              # Returns 1

# We can add several rows one after another. Let's add twice as many as N
for row in pd.np.random.rand(2 * N, 5):
    data.loc[counter, :] = row
    counter = (counter + 1) % N

# Now that we added them, we still have only the last N rows
len(data)                       # Returns N

This avoids the need to modify the data in any way, and would be a fast approach to appending the data. However, to reading from the data can be slower if:

  • The order of the data matters. If you need the data in the same order, you need to slice data using counter to extract the original order.
  • The number of rows is small. If you end up appending fewer rows than N, you'll need .dropna() (or count the total inserted rows) to remove the unused ones.

In most of the scenarios that I deal with where truncated append performance matters, neither of the above are true, but your scenario may be different. In that case, @Alexander has a good solution involving .shift().

S Anand
  • 11,364
  • 2
  • 28
  • 23
2

pandas stores data in arrays. Doing the sort of operation you want inherently requires a copy for an array data structure. Since data is stored in contiguous (or strided) memory, adding something to the end and removing something from the beginning requires copying everything to a new region of memory. There is no way around this. You need to use a different data structure.

Edit: Thinking about this a bit more, I see two approaches to do this.

The simplest and most straightforward would be to use a collections.deque of tuples. You can just append a new tuple to the end, and if it gets too full it will dump the corresponding on from the beginning. At the end, you can just convert them into a DataFrame. I am just using the for loop as an example, I gather you get your data in a different way. It wouldn't matter:

import pandas as pd
from collections import deque

maxlen = 1000

dq = deque(maxlen=maxlen)

for reading1, reading3, reading3 in readings:
    dq.append(pd.Series([reading1, reading2, reading3], 
                        index=['sensor1', 'sensor2', 'sensor3'], 
                        name=time.strftime("%Y-%m-%d %H:%M:%S")))

df = pd.concat(dq, axis=1).T

The second approach is to use a DataFrame of a fixed size, and use the modulo of the maximum length to choose the place to overwrite, but also keep the item number in the DataFrame. Then you can sort by item number. In your case, you could conceivably sort by time, but this approach is more general. As with the previous example, I will use a for loop to demonstrate, but you probably don't have one. Further, I will also assume that you don't have a real iterable you can enumerate, if you do then you don't have to keep track of the index number as I do here:

import pandas as pd

maxlen = 1000

df = pd.DataFrame(np.full((maxlen, 5), np.nan),
                  columns=['index', 'time', 
                           'sensor1', 'sensor2', 'sensor3'])

i = 0
for reading1, reading3, reading3 in readings:
    df.loc[i%maxlen, :] = [i, time.strftime("%Y-%m-%d %H:%M:%S"),
                           reading1, reading2, reading3]
    i+=1

df.sort('index', inplace=True)
del df['index']
df.set_index('time', drop=True, inplace=True)
TheBlackCat
  • 9,791
  • 3
  • 24
  • 31
  • I guess, I understood it other way round. Say you have 10 rows, when 1 row comes in, you remove it from top and add this new row at the bottom. Size of dataframe always remains 10. Hence, taking back my solution. Thanks for clarifying. – Zero Apr 13 '15 at 15:57
  • @TheBlackCat What different data structure would you suggest? It doesn't necessarily need to be pandas, pandas just seemed convenient for analysis: `df.plot()` and `to_csv` – ps.george Apr 14 '15 at 08:13
1

This example initializes a DataFrame equal to the max size and fills it with Nones. It then iterates over a list of new rows, first shifting the original DataFrame and then appending the new row to the end. You didn't specify how you wanted to treat the index, so I ignored it.

max_rows = 5
cols = list('AB')

# Initialize empty DataFrame
df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in cols})

new_rows = [pd.DataFrame({'A': [1], 'B': [10]}), 
            pd.DataFrame({'A': [2], 'B': [11]}),
            pd.DataFrame({'A': [3], 'B': [12]}),
            pd.DataFrame({'A': [4], 'B': [13]}),
            pd.DataFrame({'A': [5], 'B': [14]}),
            pd.DataFrame({'A': [6], 'B': [15]}),
            pd.DataFrame({'A': [7], 'B': [16]})]

for row in new_rows:
    df = df.shift(-1)
    df.iloc[-1, :] = row.values

>>> df
df
   A   B
0  3  12
1  4  13
2  5  14
3  6  15
4  7  16

Let's use a real example with one year of stock prices for AAPL.

from datetime import timedelta

aapl = DataReader("AAPL", data_source="yahoo", start="2014-1-1", end="2015-1-1")
cols = aapl.columns
df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in aapl.columns})[cols]
# Initialize a datetime index
df.index = pd.DatetimeIndex(end=aapl.index[0] + timedelta(days=-1), periods=max_rows, freq='D')

for timestamp, row in aapl.iterrows():
    df = df.shift(-1)
    df.iloc[-1, :] = row.values
    idx = df.index[:-1].tolist()
    idx.append(timestamp)
    df.index = idx

>>> df
              Open    High     Low   Close       Volume Adj Close
2013-12-28  112.58  112.71  112.01  112.01  1.44796e+07    111.57
2013-12-29   112.1  114.52  112.01  113.99   3.3721e+07    113.54
2013-12-30  113.79  114.77   113.7  113.91  2.75989e+07    113.46
2013-12-31  113.64  113.92  112.11  112.52  2.98815e+07    112.08
2014-12-31  112.82  113.13  110.21  110.38  4.14034e+07    109.95
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • best answer to the original question, is there a way to get this to work if I am indexing with timestamps? >>> `df = df.tail(maxlen)` is working for me – ps.george Apr 14 '15 at 08:21