1

This is related to a previous question: Python pandas change duplicate timestamp to unique, hence the similar name of this one.

The additional requirement is to handle multiple duplicates per second and space them out evenly between second boundaries, i.e.

....
2011/1/4    9:14:00
2011/1/4    9:14:00
2011/1/4    9:14:01
2011/1/4    9:14:01
2011/1/4    9:14:01
2011/1/4    9:14:01
2011/1/4    9:14:01
2011/1/4    9:15:02
2011/1/4    9:15:02
2011/1/4    9:15:02
2011/1/4    9:15:03
....

Should become:

....
2011/1/4    9:14:00
2011/1/4    9:14:00.500
2011/1/4    9:14:01
2011/1/4    9:14:01.200
2011/1/4    9:14:01.400
2011/1/4    9:14:01.600
2011/1/4    9:14:01.800
2011/1/4    9:14:02
2011/1/4    9:14:02.333
2011/1/4    9:14:02.666
2011/1/4    9:14:03
....

I am stumped on how to deal with the variable number of duplicates.

I thought along the lines of a groupby(), but couldn't get it right. I was thinking that this is a common enough use-case to have been solved already.

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
phaebz
  • 383
  • 5
  • 15

1 Answers1

1

I converted datetime column to timedelta[ms]. But problem was with too big numbers, so first I convert year to epoch time - 2011 - 1970. Then I counted differences, which were added to column one: df['one'] = df['one'] - df['new'] + df['timedelta']. Then timedeltas in miliseconds as integers was converted to timedeltas and last was added years 2011 - 1970.

#                 time
#0  2011-01-04 09:14:00
#1  2011-01-04 09:14:00
#2  2011-01-04 09:14:01
#3  2011-01-04 09:14:01
#4  2011-01-04 09:14:01
#5  2011-01-04 09:14:01
#6  2011-01-04 09:14:01
#7  2011-01-04 09:15:02
#8  2011-01-04 09:15:02
#9  2011-01-04 09:15:02
#10 2011-01-04 09:15:03
#time    datetime64[ns]

#remove years for less timedeltas
df['time1'] = df['time'].apply(lambda x: x - pd.DateOffset(years=2011-1970))
#convert time to timedeltas in miliseconds
df['timedelta'] = pd.to_timedelta(df['time1']) / np.timedelta64(1, 'ms')
df['one'] = 1
#count differences by groupby and transforming mean/sum
m = lambda x: (x.mean()) / x.sum()
df['one'] = df.groupby('time')['one'].transform(m)
#cast float to integer
df['new'] = (df['one']*1000).astype(int)
#need differences by cumulative sum
df['one'] = df.groupby('time')['new'].transform(np.cumsum)
#column cumulatice sum substracting differences and added timedelta
df['one'] = df['one'] - df['new'] + df['timedelta']
#convert integer to timedelta
df['final'] = pd.to_timedelta(df['one'],unit='ms')
#add removed years
df['final'] = df['final'].apply(lambda x: pd.to_datetime(x) + pd.DateOffset(years=2011-1970))
#remove unnecessary columns
df = df.drop(['time1', 'timedelta', 'one', 'new'], axis=1)
print df
#                  time                   final
#0  2011-01-04 09:14:00 2011-01-04 09:14:00.000
#1  2011-01-04 09:14:00 2011-01-04 09:14:00.500
#2  2011-01-04 09:14:01 2011-01-04 09:14:01.000
#3  2011-01-04 09:14:01 2011-01-04 09:14:01.200
#4  2011-01-04 09:14:01 2011-01-04 09:14:01.400
#5  2011-01-04 09:14:01 2011-01-04 09:14:01.600
#6  2011-01-04 09:14:01 2011-01-04 09:14:01.800
#7  2011-01-04 09:15:02 2011-01-04 09:15:02.000
#8  2011-01-04 09:15:02 2011-01-04 09:15:02.333
#9  2011-01-04 09:15:02 2011-01-04 09:15:02.666
#10 2011-01-04 09:15:03 2011-01-04 09:15:03.000
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252