I am new to data science and machine learning, and I am working on a project involving time series data from wearable devices (using Python programming environment). I have the sampling frequency of each sensor modality for each device. Some of the devices have provided their sensor data as datapoints with timestamps. For example, the dataset for the accelerometer sensor of one device includes the following columns:
timestamp
: a UNIX timestamp for each recorded datapointax
: pitchay
: rollaz
: yaw
I am planning to smooth the accelerometer signal by applying a 1-minute central moving window to the time series data.
However, I have noticed that there are sometimes gaps in the data, where there are fewer data points than expected within a given time window (based on the sampling rate). In addition, there are occasional cases where there are more data points than expected within a time window. If I apply a central moving average to this inconsistent data, it will likely produce incorrect results.
To handle gaps and extra data points in my accelerometer sensor data, I have come up with the following solution:
- Create a new pandas dataframe using the date range of the first and last timestamp of the original dataset.
- Round the timestamps of both datasets to the nearest second.
- Select the starting timestamp of the inconsistent dataset (rounded to the nearest second), and select all data points with the same timestamp (which may be fewer or more than expected). Replace these data points in my new dataframe with the same timestamp.
- Repeat this process by looping through the inconsistent dataset and adding one second to the starting time at each iteration.
# Get the starting time and end time from the inconsistent dataframe
start_time = df_temp.loc[0, 'timestamp'].replace(tzinfo=datetime.timezone.utc)
end_time = df_temp.loc[df_temp.index[-1], 'timestamp'].replace(tzinfo=datetime.timezone.utc)
# Create new dataframe with the datarange
date_range = pd.date_range(start=start_time, end=end_time, freq='10ms')
df = pd.DataFrame(index=date_range, columns=['timestamp', 'acc_x_avg', 'acc_y_avg', 'acc_z_avg', 'gyro_x_avg', 'gyro_y_avg', 'gyro_z_avg'])
df['timestamp'] = [index for index in df.index]
df['timestamp'] = pd.to_datetime(df['timestamp']).dt.round('S').dt.strftime('%Y-%m-%d %H:%M:%S')
df_temp['timestamp_iso'] = df_temp['timestamp']
df_temp['timestamp'] = pd.to_datetime(df_temp['timestamp']).dt.round('S').dt.strftime('%Y-%m-%d %H:%M:%S')
current_second = df.loc[df.index[0], 'timestamp']
current_second_iso = datetime.datetime.strptime(current_second, "%Y-%m-%d %H:%M:%S")
end_second = df.loc[df.index[-1], 'timestamp']
end_second_iso = datetime.datetime.strptime(end_second, "%Y-%m-%d %H:%M:%S")
while current_second_iso < end_second_iso:
progress += frequency
# Select both dataframes current second indices
current_second_indices = df_temp.loc[df_temp['timestamp'] == str(current_second_iso)].index #integers
current_second_new_indices = df.loc[df['timestamp'] == str(current_second_iso)].index #datetime
print(file_path, f'{(progress*frequency)/len(df)}%')
for i in range(len(current_second_new_indices)):
if (i < len(current_second_indices)):
df.loc[current_second_new_indices[i], 'timestamp'] = df_temp.loc[current_second_indices[i], 'timestamp']
df.loc[current_second_new_indices[i], 'acc_x_avg'] = df_temp.loc[current_second_indices[i], 'acc_x']
df.loc[current_second_new_indices[i], 'acc_y_avg'] = df_temp.loc[current_second_indices[i], 'acc_y']
df.loc[current_second_new_indices[i], 'acc_z_avg'] = df_temp.loc[current_second_indices[i], 'acc_z']
df.loc[current_second_new_indices[i], 'gyro_x_avg'] = df_temp.loc[current_second_indices[i], 'gyro_x']
df.loc[current_second_new_indices[i], 'gyro_y_avg'] = df_temp.loc[current_second_indices[i], 'gyro_y']
df.loc[current_second_new_indices[i], 'gyro_z_avg'] = df_temp.loc[current_second_indices[i], 'gyro_z']
current_second_iso = current_second_iso + datetime.timedelta(seconds=1)
I believe there may be faster ways to handle this issue.