4

I have a dataset of the following form dropbox download (23kb csv)

The sample rate of the data varies from second to second from 0Hz to over 200Hz in some cases, the highest rate of samples in the data set provided is about 50 samples per second.

When samples are taken they are always even spread across the second for example

time                   x
2012-12-06 21:12:40    128.75909883327378
2012-12-06 21:12:40     32.799224301545976
2012-12-06 21:12:40     98.932953779777989
2012-12-06 21:12:43    132.07033814856786
2012-12-06 21:12:43    132.07033814856786
2012-12-06 21:12:43     65.71691352191452
2012-12-06 21:12:44    117.1350194748169
2012-12-06 21:12:45     13.095622561808861
2012-12-06 21:12:47     61.295242676059246
2012-12-06 21:12:48     94.774064119961352
2012-12-06 21:12:49     80.169378222553533
2012-12-06 21:12:49     80.291142695702533
2012-12-06 21:12:49    136.55650749231367
2012-12-06 21:12:49    127.29790925838365

should be

time                        x
2012-12-06 21:12:40 000ms   128.75909883327378
2012-12-06 21:12:40 333ms    32.799224301545976
2012-12-06 21:12:40 666ms    98.932953779777989
2012-12-06 21:12:43 000ms   132.07033814856786
2012-12-06 21:12:43 333ms   132.07033814856786
2012-12-06 21:12:43 666ms    65.71691352191452
2012-12-06 21:12:44 000ms   117.1350194748169
2012-12-06 21:12:45 000ms    13.095622561808861
2012-12-06 21:12:47 000ms    61.295242676059246
2012-12-06 21:12:48 000ms    94.774064119961352
2012-12-06 21:12:49 000ms    80.169378222553533
2012-12-06 21:12:49 250ms    80.291142695702533
2012-12-06 21:12:49 500ms   136.55650749231367
2012-12-06 21:12:49 750ms   127.29790925838365

is there an easy way to use the pandas timeseries resampling function or is there some thing built into numpy or scipy that will work?

lab_notes
  • 407
  • 5
  • 11

2 Answers2

4

I don't think there is an inbuilt pandas or numpy method/function to do this.

However, I would favour using a python generator:

def repeats(lst):
    i_0 = None
    n = -1 # will still work if lst starts with None
    for i in lst:
        if i == i_0:
            n += 1
        else:
            n = 0
        yield n
        i_0 = i
# list(repeats([1,1,1,2,2,3])) == [0,1,2,0,1,0]

Then you can put this generator into a numpy array:

import numpy as np
df['rep'] = np.array(list(repeats(df['time'])))

Count up the repeats:

from collections import Counter
count = Counter(df['time'])
df['count'] = df['time'].apply(lambda x: count[x])

and do the calculation (this is the most expensive part of the calculation):

df['time2'] = df.apply(lambda row: (row['time'] 
                                 + datetime.timedelta(0, 1) # 1s
                                     * row['rep'] 
                                     / row['count']),
                 axis=1)

Note: to remove the calculation columns, use del df['rep'] and del df['count'].

.

One "built-in" way to accomplish it might be accomplished using shift twice, but I think this is going to be somewhat messier...

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • +1 for the nice generator but your last step where you calculate third will have to be a little trickier because not all his duplicates come in 3's. See the last set in his example where there are 4. – GaryBishop Dec 09 '12 at 15:54
  • @GaryBishop apologies I hadn't seen that, so made the simplification (it makes sense they wouldn't all come as thirds!) ... – Andy Hayden Dec 09 '12 at 16:36
  • @hayden I'm just going to test your solution on the data set but it looks very promising. – lab_notes Dec 09 '12 at 19:10
2

I found this an excellent use case for pandas groupby mechanism, so I wanted to provide a solution for this as well. I find it slightly more readible than Andy's solution, but it's actually not thaat much shorter.

# First, get your data into a dataframe after having copied 
# it with the mouse into a multi-line string:

import pandas as pd
from StringIO import StringIO

s = """2012-12-06 21:12:40    128.75909883327378
2012-12-06 21:12:40     32.799224301545976
2012-12-06 21:12:40     98.932953779777989
2012-12-06 21:12:43    132.07033814856786
2012-12-06 21:12:43    132.07033814856786
2012-12-06 21:12:43     65.71691352191452
2012-12-06 21:12:44    117.1350194748169
2012-12-06 21:12:45     13.095622561808861
2012-12-06 21:12:47     61.295242676059246
2012-12-06 21:12:48     94.774064119961352
2012-12-06 21:12:49     80.169378222553533
2012-12-06 21:12:49     80.291142695702533
2012-12-06 21:12:49    136.55650749231367
2012-12-06 21:12:49    127.29790925838365"""

sio = StringIO(s)
df = pd.io.parsers.read_csv(sio, parse_dates=[[0,1]], sep='\s*', header=None)
df = df.set_index('0_1')
df.index.name = 'time'
df.columns = ['x']

So far, this was only data preparation, so if you want to compare length of the solutions, do it from now on! ;)

# Now, groupby the same time indices:

grouped = df.groupby(df.index)

# Create yourself a second object
from datetime import timedelta
second = timedelta(seconds=1)

# loop over group elements, catch new index parts in list
l = []
for _,group in grouped:
    size = len(group)
    if size == 1:
        # go to pydatetime for later addition, so that list is all in 1 format
        l.append(group.index.to_pydatetime())
    else:
        offsets = [i * second / size for i in range(size)]
        l.append(group.index.to_pydatetime() + offsets)

# exchange index for new index
import numpy as np
df.index = pd.DatetimeIndex(np.concatenate(l))
K.-Michael Aye
  • 5,465
  • 6
  • 44
  • 56