1

I have a .csv with ~500k rows with timestamps that look like this : 2021-02-01 00:00:29.159 UTC

I want to resample the data to every 300 milliseconds.

I convert the 'timestamp' column to datetime:

df.timestamp = pd.to_datetime(df.timestamp)

Now they look like this: 2021-02-01 00:00:29.159000+00:00

Now I resample:

df = df.set_index(['timestamp']).resample("300ms").backfill()

and get error:

ValueError: cannot reindex a non-unique index with a method or limit

Which, I assume, means there are duplicate timestamps?

So I drop_duplicates:

print(df.drop_duplicates(subset=['timestamp'], keep='first').duplicated().any())

and get:

False

Which is good right? I run the resampling again, and get the same error. So I build a quick check for the drop duplicates:

duplicatedRows = df[df.duplicated((['timestamp']))]
print(duplicatedRows, sep=' ')

and it prints out the 22 duplicate rows. When i check the results, none are duplicates of each other at all?

So my questions are: have I done it all right? and what would be the bettter way of achieving my goal of resampling data like this to 300ms (1 row every 300milliseconds).

I am an intermediate programmer but new to python so most likely some simple issue

cheers

squashler
  • 21
  • 1

2 Answers2

0

df.timestamp = pd.to_datetime(df.timestamp) #fail to parse the value as a time. I am getting NaT. I converted to iso time

df=pd.DataFrame({'timestamp':['2021-02-01T00:00:29.159 UTC','2021-02-01T00:00:35.159 UTC']})
df['timestamp']=df['timestamp'].apply(lambda row: row.replace(' UTC','Z').replace(' ','T'))
df['timestamp']=df['timestamp'].apply(lambda timestamp: datetime.strptime(timestamp, '%Y-%m-%dT%H:%M:%S.%f%z'))
df=df.set_index('timestamp')
df = df.resample('300ms')
print(*df)

output:

Name: timestamp, dtype: datetime64[ns, UTC] (Timestamp('2021-02-01 00:00:29.100000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: [2021-02-01 00:00:29.159000+00:00]) (Timestamp('2021-02-01 00:00:29.400000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:29.700000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:30+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:30.300000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:30.600000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:30.900000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:31.200000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:31.500000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:31.800000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:32.100000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:32.400000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:32.700000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:33+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:33.300000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:33.600000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:33.900000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:34.200000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:34.500000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:34.800000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: []) (Timestamp('2021-02-01 00:00:35.100000+0000', tz='UTC', freq='300L'), Empty DataFrame Columns: [] Index: [2021-02-01 00:00:35.159000+00:00])

Golden Lion
  • 3,840
  • 2
  • 26
  • 35
0

Forgot to add "inplace=True" to the drop_duplicates, which is why the duplicates weren't being removed

squashler
  • 21
  • 1