0

The data in test.csv likes this:

device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_add_8hour,upload_time_year_month,car_id,car_type,car_num,marketer_name
1101,2020-09-30 16:03:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:03:41,202010,18,1,,
1101,2020-09-30 16:08:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:08:41,202010,18,1,,
1101,2020-09-30 16:13:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:13:41,202010,18,1,,
1101,2020-09-30 16:18:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:18:41,202010,18,1,,
1101,2020-10-02 08:19:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:19:41,202010,18,1,,
1101,2020-10-02 08:24:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:24:41,202010,18,1,,
1101,2020-10-02 08:29:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:29:41,202010,18,1,,
1101,2020-10-02 08:34:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:34:41,202010,18,1,,
1101,2020-10-02 08:39:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:39:41,202010,18,1,,
1101,2020-10-02 08:44:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:44:41,202010,18,1,,
1101,2020-10-02 08:49:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:49:41,202010,18,1,,
1101,2020-10-06 11:11:10+00:00,46.7245,131.14015,0.1,,2.1,2020/10/6 19:11:10,202010,18,1,,
1101,2020-10-06 11:16:10+00:00,46.7245,131.14015,0.1,,2.2,2020/10/6 19:16:10,202010,18,1,,
1101,2020-10-06 11:21:10+00:00,46.7245,131.14015,0.1,,3.84,2020/10/6 19:21:10,202010,18,1,,
1101,2020-10-06 16:46:10+00:00,46.7245,131.14015,0,,0,2020/10/7 0:46:10,202010,18,1,,
1101,2020-10-07 04:44:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:44:27,202010,18,1,,
1101,2020-10-07 04:49:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:49:27,202010,18,1,,
1101,2020-10-07 04:54:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:54:27,202010,18,1,,
1101,2020-10-07 04:59:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:59,202010,18,1,,
1101,2020-10-07 05:04:27+00:00,46.724366,131.1402,1,,0,2020/10/7 13:04:27,202010,18,1,,

I use this code to get the data with the speed is 0 in the dataframe, and then group the dataframe according to latitude, longitude,year,month and day.

After grouping, get the first upload_time_add_8hour and the last upload_time_add_8hour of each group. If the difference more than 5 minutes between the first upload_time_add_8hour and the last upload_time_add_8hour, get the first row of data for each group, and finally save these data to csv.

I think my code is not concise enough.

I use df_first_row = sub_df.iloc[0:1,:] to get the first row in the dataframe, I use upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0] and upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] to get the first element and the last element of a specific column.

Is there any more suitable way?

My code:

import pandas as pd

device_csv_name = r'E:/test.csv'
df = pd.read_csv(device_csv_name, parse_dates=[7], encoding='utf-8', low_memory=False)
df['upload_time_year_month_day'] = df['upload_time_add_8hour'].dt.strftime('%Y%m%d')
df['upload_time_year_month_day'] = df['upload_time_year_month_day'].astype(str)
df_speed0 = df[df['speed'].astype(float) == 0.0] #Get data with speed is 0.0
gb = df_speed0.groupby(['latitude', 'longitude', 'upload_time_year_month_day'])
sub_dataframe_list = []
for i in gb.indices:
    sub_df = pd.DataFrame(gb.get_group(i))
    sub_df = sub_df.sort_values(by=['upload_time_add_8hour'])
    count_row = sub_df.shape[0] #get row count
    if count_row>1: #each group must have more then 1 row
        upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0]  # get first upload_time_add_8hour
        upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1]  # get last upload_time_add_8hour
        minutes_diff = (upload_time_add_8hour_last - upload_time_add_8hour_first).total_seconds() / 60.0
        if minutes_diff >= 5: # if minutes_diff>5,append the first row of dataframe to sub_dataframe_list
            df_first_row  = sub_df.iloc[0:1,:]
            sub_dataframe_list.append(df_first_row)

if sub_dataframe_list:
    result = pd.concat(sub_dataframe_list,ignore_index=True)
    result = result.sort_values(by=['upload_time'])
    result.to_csv(r'E:/for_test.csv', index=False, mode='w', header=True,encoding='utf-8')

2 Answers2

1

To get the first and last element of the column, your option is already the most efficient/correct way. If you're interested in this topic, I can recommend you to read this other Stackoverflow answer: https://stackoverflow.com/a/25254087/8294752

To get the first row, I personally prefer to use DataFrame.head(1), therefore for your code something like this:

df_first_row = sub_df.head(1)

I didn't look into how the head() method is defined in Pandas and its performance implications, but in my opinion it improves readability and reduces some potential confusion with indexes.

In other examples you might also find sub_df.iloc[0], but this option will return a pandas.Series which has as indexes the DataFrame column names. sub_df.head(1) will return a 1-row DataFrame instead, which is the same result as sub_df.iloc[0:1,:]

arabinelli
  • 1,006
  • 1
  • 8
  • 19
0

Your way out is either groupby().agg or df. agg

If you need it it as per device you can

#sub_df.groupby('device_id')['upload_time_add_8hour'].agg(['first','last'])


sub_df.groupby('device_id')['upload_time_add_8hour'].agg([('upload_time_add_8hour_first','first'),('upload_time_add_8hour_last ','last')]).reset_index()


device_id upload_time_add_8hour_first    upload_time_add_8hour_last 
0       1101              10/1/2020 0:03             10/7/2020 13:04

If you do not want it as per device, maybe try

sub_df['upload_time_add_8hour'].agg({'upload_time_add_8hour_first': lambda x: x.head(1),'upload_time_add_8hour_last': lambda x: x.tail(1)})

upload_time_add_8hour_first  0      10/1/2020 0:03
upload_time_add_8hour_last   19    10/7/2020 13:04
wwnde
  • 26,119
  • 6
  • 18
  • 32