1

I have a dataframe which looks like the following:

 START_TIME   END_TIME     TRIAL_No        itemnr
 2403950      2413067      Trial: 1        P14
 2413378      2422499      Trial: 2        P03
 2422814      2431931      Trial: 3        P13
 2432246      2441363      Trial: 4        P02
 2523540      2541257      Trial: 5        P11
 2541864      2560297      Trial: 6        P10
 2560916      2577249      Trial: 7        P05

The table goes on and on like that. The START_TIME and END_TIME are all in milliseconds which are the start and end time of a trial. So what I want to do is, I want to resample the START_TIME into 100milliseconds bin itme and interpolate the variables (TRIAL_No and itemnr) between each START_TIME and END_TIME. Outside of these regions, these variables should have the value "NA". For example, for the first row the START_TIME is 2403950 and the END_TIME is 2413067. The difference between them is 9117 milliseconds. So "Trial: 1" stays for 9117msecs which is for aroud 91 bin times since each bin time is 100msec apart. So I want to repeat "Trial_1" and "P14" 91 times in the resulting dataframe. The same goes for the rest. Looks like the following:

Bin_time     TRIAL_No    itemnr
2403950      Trial: 1    P14
2404050      Trial: 1    P14
2404150      Trial: 1    P14
            ...
2413050      Trial: 1    P14
2413150      Trial: 2    P03
2413250      Trial: 2    P03

and so on. I am not sure if it is possible directly in pandas or some preprocessing is needed.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Same
  • 759
  • 2
  • 9
  • 15

1 Answers1

1

After creating new dataframe by concat dataframes I can group it by row and apply resample on each of these groups (with method ffill to forward fill).

print df
#   START_TIME  END_TIME  TRIAL_No itemnr
#0     2403950   2413067  Trial: 1    P14
#1     2413378   2422499  Trial: 2    P03
#2     2422814   2431931  Trial: 3    P13
#3     2432246   2441363  Trial: 4    P02
#4     2523540   2541257  Trial: 5    P11
#5     2541864   2560297  Trial: 6    P10
#6     2560916   2577249  Trial: 7    P05

#PREDPROCESSING
#helper column for matching start and end rows
df['row'] = range(len(df))

#reshape to df - every row two times repeated for each date of START_TIME and END_TIME
starts = df[['START_TIME','TRIAL_No','itemnr','row']].rename(columns={'START_TIME':'Bin_time'})
ends = df[['END_TIME','TRIAL_No','itemnr','row']].rename(columns={'END_TIME':'Bin_time'})
df = pd.concat([starts, ends])
df = df.set_index('row', drop=True)
df = df.sort_index()

#convert miliseconds to timedelta for resampling by time 100ms
df['Bin_time'] = df['Bin_time'].astype('timedelta64[ms]')
print df
#           Bin_time  TRIAL_No itemnr
#row                                 
#0   00:40:03.950000  Trial: 1    P14
#0   00:40:13.067000  Trial: 1    P14
#1   00:40:13.378000  Trial: 2    P03
#1   00:40:22.499000  Trial: 2    P03
#2   00:40:22.814000  Trial: 3    P13
#2   00:40:31.931000  Trial: 3    P13
#3   00:40:32.246000  Trial: 4    P02
#3   00:40:41.363000  Trial: 4    P02
#4   00:42:03.540000  Trial: 5    P11
#4   00:42:21.257000  Trial: 5    P11
#5   00:42:21.864000  Trial: 6    P10
#5   00:42:40.297000  Trial: 6    P10
#6   00:42:40.916000  Trial: 7    P05
#6   00:42:57.249000  Trial: 7    P05

print df.dtypes
#Bin_time    timedelta64[ms]
#TRIAL_No             object
#itemnr               object
#dtype: object
#resample and fill missing data 
df = df.groupby(df.index).apply(lambda x: x.set_index('Bin_time').resample('100ms',how='first',fill_method='ffill'))

df = df.reset_index()
df = df.drop(['row'], axis=1)

#convert timedelta to integer back
df['Bin_time'] = (df['Bin_time'] / np.timedelta64(1, 'ms')).astype(int)

print df.head()
#  Bin_time  TRIAL_No itemnr
#0  2403950  Trial: 1    P14
#1  2404050  Trial: 1    P14
#2  2404150  Trial: 1    P14
#3  2404250  Trial: 1    P14
#4  2404350  Trial: 1    P14

EDIT:

If you want get NaN outside of groups, you can change code after groupby:

#resample and fill missing data 
df = df.groupby(df.index).apply(lambda x: x.set_index('Bin_time').resample('100ms', how='first',fill_method='ffill'))

#reset only first level - drop index row
df = df.reset_index(level=0, drop=True)
#resample by 100ms, outside are NaN
df = df.resample('100ms', how='first')
df = df.reset_index()
#convert timedelta to integer back
df['Bin_time'] = (df['Bin_time'] / np.timedelta64(1, 'ms')).astype(int)

print df
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks. It worked. I have another question in another thread but no one seem to answer me. So if you have time you can take a look at it. here is the [link](http://stackoverflow.com/questions/33898674/python-pandas-resampling) – Same Nov 27 '15 at 12:54
  • I did accept it. Don't forget to look at my another question. Here is the [link](http://stackoverflow.com/questions/33898674/python-pandas-resampling) – Same Nov 27 '15 at 13:05
  • Another answer is done, please check it, if i understand correctly it. Thanks. – jezrael Nov 27 '15 at 13:36
  • I have closely checked the output of this answer and I saw that each start and end time are grouped and resampled in their own group. But is it possible that ever entry in the bin time are 100ms apart. and the values for the other columns outside the region will be "NA"? – Same Nov 27 '15 at 14:34
  • Perfect. Thanks a lot. – Same Nov 27 '15 at 15:05