The data in test.csv
likes this:
device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_add_8hour,upload_time_year_month,car_id,car_type,car_num,marketer_name
1101,2020-09-30 16:03:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:03:41,202010,18,1,,
1101,2020-09-30 16:08:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:08:41,202010,18,1,,
1101,2020-09-30 16:13:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:13:41,202010,18,1,,
1101,2020-09-30 16:18:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:18:41,202010,18,1,,
1101,2020-10-02 08:19:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:19:41,202010,18,1,,
1101,2020-10-02 08:24:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:24:41,202010,18,1,,
1101,2020-10-02 08:29:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:29:41,202010,18,1,,
1101,2020-10-02 08:34:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:34:41,202010,18,1,,
1101,2020-10-02 08:39:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:39:41,202010,18,1,,
1101,2020-10-02 08:44:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:44:41,202010,18,1,,
1101,2020-10-02 08:49:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:49:41,202010,18,1,,
1101,2020-10-06 11:11:10+00:00,46.7245,131.14015,0.1,,2.1,2020/10/6 19:11:10,202010,18,1,,
1101,2020-10-06 11:16:10+00:00,46.7245,131.14015,0.1,,2.2,2020/10/6 19:16:10,202010,18,1,,
1101,2020-10-06 11:21:10+00:00,46.7245,131.14015,0.1,,3.84,2020/10/6 19:21:10,202010,18,1,,
1101,2020-10-06 16:46:10+00:00,46.7245,131.14015,0,,0,2020/10/7 0:46:10,202010,18,1,,
1101,2020-10-07 04:44:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:44:27,202010,18,1,,
1101,2020-10-07 04:49:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:49:27,202010,18,1,,
1101,2020-10-07 04:54:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:54:27,202010,18,1,,
1101,2020-10-07 04:59:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:59,202010,18,1,,
1101,2020-10-07 05:04:27+00:00,46.724366,131.1402,1,,0,2020/10/7 13:04:27,202010,18,1,,
I use this code to get the data with the speed is 0 in the dataframe, and then group the dataframe according to latitude, longitude,year,month and day.
After grouping, get the first upload_time_add_8hour
and the last upload_time_add_8hour
of each group. If the difference more than 5 minutes between the first upload_time_add_8hour
and the last upload_time_add_8hour
, get the first row of data for each group, and finally save these data to csv.
I think my code is not concise enough.
I use df_first_row = sub_df.iloc[0:1,:]
to get the first row in the dataframe, I use upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0]
and upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1]
to get the first element and the last element of a specific column.
Is there any more suitable way?
My code:
import pandas as pd
device_csv_name = r'E:/test.csv'
df = pd.read_csv(device_csv_name, parse_dates=[7], encoding='utf-8', low_memory=False)
df['upload_time_year_month_day'] = df['upload_time_add_8hour'].dt.strftime('%Y%m%d')
df['upload_time_year_month_day'] = df['upload_time_year_month_day'].astype(str)
df_speed0 = df[df['speed'].astype(float) == 0.0] #Get data with speed is 0.0
gb = df_speed0.groupby(['latitude', 'longitude', 'upload_time_year_month_day'])
sub_dataframe_list = []
for i in gb.indices:
sub_df = pd.DataFrame(gb.get_group(i))
sub_df = sub_df.sort_values(by=['upload_time_add_8hour'])
count_row = sub_df.shape[0] #get row count
if count_row>1: #each group must have more then 1 row
upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0] # get first upload_time_add_8hour
upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] # get last upload_time_add_8hour
minutes_diff = (upload_time_add_8hour_last - upload_time_add_8hour_first).total_seconds() / 60.0
if minutes_diff >= 5: # if minutes_diff>5,append the first row of dataframe to sub_dataframe_list
df_first_row = sub_df.iloc[0:1,:]
sub_dataframe_list.append(df_first_row)
if sub_dataframe_list:
result = pd.concat(sub_dataframe_list,ignore_index=True)
result = result.sort_values(by=['upload_time'])
result.to_csv(r'E:/for_test.csv', index=False, mode='w', header=True,encoding='utf-8')