3

I'm using pandas.DataFrame to store 3 hours of sensor data sampled at the second interval. So each second, I'm adding a row and dropping rows older than 3 hours.

Currently, I'm doing it very inefficiently:

record = pd.DataFrame.from_records([record], index='Date')
if self.data.empty:
    #logger.debug('Creating data log')
    self.data = record
else:
    #logger.debug('Appending new record')
    self.data = self.data.append(record)
start = now - self.keepInMemory
self.data = self.data[self.data.index > start]

Namely, a new DataFrame is created, then it is appended, and then old records are removed. It is slow, inefficient, and certainly does a lot of memory reallocation.

What I'm looking for is:

  • Pre-allocated DataFrame
  • Remove old records (without reallocation)
  • Add new record

What is the most Panda-ish way to accomplish that?

Thank you.

P.s. The only relevant question on SO I managed to find was: deque in python pandas but it didn't help.

Update: Using DataFrame and not deque is a requirement, since other modules use self.data as a service for computing generic conditions, e.g. ('is std() of last 15 minutes differs from this of the first' and similar). To stress, it's not just for recording data, it's for providing ability to for other modules to compute various generic conditions efficiently and conveniently.

I suspect there might be a clever play with indices (e.g. data.index = np.roll(data.index,1)) and then replacing the last row inplace, but until now I could not figure out how to do that efficiently. New record has the same format as the rest, so it should be possible.

Community
  • 1
  • 1
fAX
  • 1,451
  • 1
  • 10
  • 11
  • 1
    Is using a pandas DataFrame a requirement? Using `deque` is an appropriate object type for your project. – James Jan 07 '17 at 22:07
  • can you post a sample data set? How does your `record` look like? Is it always only one row? Is it necessarry to have a TimeStamp as an index? – MaxU - stand with Ukraine Jan 07 '17 at 23:00
  • James, yes, this is requirement, since other modules use this DataFrame as a service for computing generic conditions, e.g. ('is std() of last 15 minutes differs from this of the first' and similar). It's not just for recording data, it's for providing ability to for other modules to compute various generic conditions efficiently and conveniently. – fAX Jan 08 '17 at 09:00
  • Max, record is always one row, but data structure may vary between invocations (during the single run it is uniform). – fAX Jan 08 '17 at 09:01

1 Answers1

0

work in progress

See comments below. I'll leave answer as it is until I can work something out. I don't want anyone to think that this solves the problem.


Consider the dataframe df with timeseries index tidx. tidx starts off with 70 days worth of dates.

tidx = pd.date_range('2011-03-01', periods=70)
df = pd.DataFrame(dict(A=np.arange(70)), tidx)

Suppose we get a new time stamp for which we will record new data. I happened to just add a day to the max of existing days, but this shouldn't matter. We can append a new row by assigning a series to df at row with index value equal to the new date. We use loc to do this.

This operation should be inplace fairly efficient.

new_index = df.index.max() + pd.offsets.Day()
df.loc[new_index] = pd.Series([99], df.columns)

Now we can define the amount of time you want to keep with a pd.offsets object. I chose 60 days for demonstration purposes. Three hours would have been pd.offsets.Hour(3). I find the index values that are too old and I drop them... again, inplace

keep = pd.offsets.Day(60)
drops = df.index[df.index < (df.index.max() - keep)]

df.drop(drops, inplace=True)

You should be able to apply this and should be more efficient then what you're doing.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    why don't you show if this actually is more efficient; you will find it's not – Jeff Jan 08 '17 at 07:04
  • 2
    What Jeff is getting at: In place operations in pandas aren't guaranteed to actually be in place. It really only guarantees that, semantically, the object you're calling the method on will be altered from Python's perspective. Pandas will sometimes implement `df.method(inplace=True)` as `df = df.method()` in the background. (The copy is still happening, you're just not seeing it.) See Jeff's answer here: http://stackoverflow.com/questions/22532302/pandas-peculiar-performance-drop-for-inplace-rename-after-dropna – Tyler Jan 08 '17 at 09:20