0

I am working with an eye-tracking data set, and for some reason after the column df['timestamp'] surpasses 1,000,000 values in the dataframe are rounded off to the next 100. This is problematic, because the eye-tracker stores a new datapoint roughly at each increase of 20.

I managed to find a solution that works for me, but I was wondering if there is a vectorized method that is more elegant?

# create a variable that tracks the difference in time
df['dt'] = (df['timestamp'] - df['timestamp'].shift(1))

# I want to keep the old timestamps, so I make a new column
df['new_timestamp'] = df['timestamp']

for i in range(1,6):
df['new_timestamp'] = np.where(df['dt'] == 0,
                              df['new_timestamp'] + 20,
                              df['new_timestamp'])
df['dt'] = (df['new_timestamp'] - df['new_timestamp'].shift(1))

Edit:

To be more precise, certain values have a pattern like this:

Current      Corrected    
5113100.0    5113100.0
5113100.0    5113120.0
5113100.0    5113140.0
5113100.0    5113160.0
5113100.0    5113180.0
5113200.0    5113200.0
Bram Zijlstra
  • 390
  • 1
  • 5
  • 13

1 Answers1

1

You can make use of .diff() method which does the difference for you (just cleaner, not faster). You can then select all rows for which the difference is 0 and add 20 to them.

df['new_timestamp'] = df['timestamp']
occurrences = df.timestamp.groupby((df.timestamp != df.timestamp.shift()).cumsum()).cumcount()
df.loc[df['timestamp'].diff() == 0, 'new_timestamp'] += 20 * occurrences

EDIT

I edited the code to account for the multiple consecutive occurrences. The trick is to count how many consecutive 0 there are and add 20 times this number. The second line is tricky but very well explained in this post

An example:

>>>      timestamp   occurences   new_timestamp
443        9860          0             9860
444        9880          0             9880
445        9880          1             9900
446        9880          2             9920
447        9880          3             9940
448        9960          0             9960
449        9980          0             9980
Prikers
  • 858
  • 1
  • 9
  • 24
  • This only changes the values after the first row every time the difference is 0, but I have to change 4 consecutive rows each time. I added an example to make it clearer. – Bram Zijlstra Jan 12 '18 at 10:18
  • 1
    oh, I see! I will try another approach. – Prikers Jan 12 '18 at 10:34
  • This works exactly as I wanted, thanks! There is a small typo in your variable timesptamp but I couldn't edit due to a character minimum. And for people with the same problem: the link explains it very well. Highly recommended, I wouldn't have thought of that. – Bram Zijlstra Jan 12 '18 at 12:34
  • 1
    Thanks. I removed the typo. – Prikers Jan 12 '18 at 12:53