4

Community,

Objective: I'm running a Pi project (i.e. Python) that communicates with an Arduino to get data from a load cell once a second. What data structure should I use to log (and do real-time analysis) on this data in Python?

I want to be able to do things like:

  1. Slice the data to get the value of the last logged datapoint.
  2. Slice the data to get the mean of the datapoints for the last n seconds.
  3. Perform a regression on the last n data points to get g/s.
  4. Remove from the log data points older than n seconds.

Current Attempts:

Dictionaries: I have appended a new key with a rounded time to a dictionary (see below), but this makes slicing and analysis hard.

log = {}

def log_data():
    log[round(time.time(), 4)] = read_data()

Pandas DataFrame: this was the one I was hopping for, because is makes time-series slicing and analysis easy, but this (How to handle incoming real time data with python pandas) seems to say its a bad idea. I can't follow their solution (i.e. storing in dictionary, and df.append()-ing in bulk every few seconds) because I want my rate calculations (regressions) to be in real time.

This question (ECG Data Analysis on a real-time signal in Python) seems to have the same problem as I did, but with no real solutions.

Goal:

So what is the proper way to handle and analyze real-time time-series data in Python? It seems like something everyone would need to do, so I imagine there has to pre-built functionality for this?

Thanks,

Michael

Community
  • 1
  • 1
Michael Molter
  • 1,296
  • 2
  • 14
  • 37
  • Have you looked at deques? https://docs.python.org/2/library/collections.html#deque-objects Also, this post is really quite broad. You should consider narrowing the scope. – Alexander Jun 16 '16 at 19:07
  • I think deques would be better than a list, but they really only solve *problem 1*. In *problem 2* and *4* I am removing by last n seconds, not n items @Alexander – Michael Molter Jun 17 '16 at 13:57
  • By the way -- you mention that the data comes in once / second. So in your application, can't you just ignore the timestamps and just slice based on the count? That would make things a little simpler. In other words, if you want the last 5 seconds, can you just do `data[-5:]` and assume that the last 5 items are the last 5 seconds? – exp1orer Jun 17 '16 at 21:52
  • I want to stay flexible in case I decide to sample more often. – Michael Molter Jun 19 '16 at 18:03

1 Answers1

1

To start, I would question two assumptions:

  1. You mention in your post that the data comes in once per second. If you can rely on that, you don't need the timestamps at all -- finding the last N data points is exactly the same as finding the data points from the last N seconds.
  2. You have a constraint that your summary data needs to be absolutely 100% real time. That may make life more complicated -- is it possible to relax that at all?

Anyway, here's a very naive approach using a list. It satisfies your needs. Performance may become a problem depending on how many of the previous data points you need to store.

Also, you may not have thought of this, but do you need the full record of past data? Or can you just drop stuff?

data = []

new_observation = (timestamp, value)

# new data comes in
data.append(new_observation)


# Slice the data to get the value of the last logged datapoint.
data[-1]

# Slice the data to get the mean of the datapoints for the last n seconds.
mean(map(lambda x: x[1], filter(lambda o: current_time - o[0] < n, data)))

# Perform a regression on the last n data points to get g/s.
regression_function(data[-n:])

# Remove from the log data points older than n seconds.
data = filter(lambda o: current_time - o[0] < n, data)
exp1orer
  • 11,481
  • 7
  • 38
  • 51
  • 100% real-time is not a necessity, because the calculations will be updating information on the GUI; however, I wanted real-time 'enough' that the user wouldn't notice a delay (i.e. withing 2 seconds). – Michael Molter Jun 20 '16 at 17:04