Avoiding for loop in a pandas data frame when working on selected rows

Question

I have a pandas data frame, df, that has second-to-second data (Longitude, Latitude, etc.) for each driver. The data frame consists of several trips. There is a feature called Event_Type that can be used to determine the start and end of trips:

ignitionOnList = df[df['Event_Type'] == 'Ignition On'].index.tolist()
ignitionOffList = df[df['Event_Type'] == 'Ignition Off'].index.tolist()

So, imagine I have 5 trips in this data frame. the length of ignitionOnList and ignitionOffList would be 5. I'd like to do analysis on each trip specifically and store them in a pandas data frame. Here's what I do:

dfTrips = pd.DataFrame({'Date' : [],'Vehicle' : [], 'Trip_Number' : [], 'Start_Time' : [], 'Duration' : [],
                    'Collision': [],'Harsh_Steering' : [], 'Harsh_Deceleration' : [], 'Harsh_Acceleration' : [],
                    'Harsh_Preferred_Speed' : []})
tripCount = -1
tripNumbers = len(ignitionOnList)
for tripNumber in range(tripNumbers):
    tripCount += 1
    dfTemp = df.loc[ignitionOnList[tripNumber]:ignitionOffList[tripNumber]+1]
    # Doing stuff to this temporary data frame and storing them, for example:
    dfTrips.loc[tripCount,'Start_Time'] = dfTemp.loc[0,'Time'].strftime("%H:%M:%S")
    dfTrips.loc[tripCount,'Finish_Time'] = dfTemp.loc[dfTemp.shape[0]-1,'Time'].strftime("%H:%M:%S")
    # Using a function I have defined named `get_steering_risk` to get risky behaviour for each trip
    dfTrips.loc[tripCount,'Harsh_Deceleration'] = get_deceleration_risk(dfTemp)
    dfTrips.loc[tripCount,'Harsh_Steering'] = get_steering_risk(dfTemp)

This works. But I am guessing there are better ways to do this in Python without for loops. I am not sure I can simply use apply because I am not applying the same function to the whole data frame.

An alternative might be to redefine the functions so that they produce a column in df and apply them to the whole data frame, and then aggregating the results for each trip. For example, get_steering_risk function can be defined to make 0 or 1 for each second in df and then the percentage of 1s for each trip would be Harsh_Steering in dfTrips. However, some functions cannot be applied on the whole data frame. For example, one function regresses the velocity versus acceleration and it should be done trip by trip. What is the best way to approach this? Thanks.

Is your second-to-second dataframe very large? I often find it more memory efficient to open the big df once and then do all my analysis on small temporary dfs, just like in your example. — andrew, May 26 '16 at 16:27
Also, the reading side of the loop does not look very inefficient. Each block of rows that goes into a dfTemp is only read once. — andrew, May 26 '16 at 16:28
@andrew The `df` data frame is not large at all. It comes from a 2 Mb csv file. However, I have thousands of these files that need to be fed into my code, that's why I am trying to make the code more efficient if possible. — ahoosh, May 26 '16 at 16:31
Per my answer below, I think you will see much worse hits to performance due to building `dfTrips` row-by-row then you will from the loop. The Pandas `read_csv` parser is also blazingly fast. I wouldn't worry about the I/O overhead. — andrew, May 26 '16 at 16:36

score 1 · Answer 1 · edited May 23 '17 at 12:07

1

I suspect that any performance issues may actually be due to the way you are growing dfTrips. I have found that it is orders of magnitude faster to create many small, even single-row (or single-column), dataframes and then join them all using pd.concat, then to try and grow one df row-by-row.

I asked about a similar problem. Check out how much faster concat is in the accepted answer.

Creating large Pandas DataFrames: preallocation vs append vs concat

[EDIT] Here is an illustration of why overwriting the temp df every iteration and appending it to a list won't work (refer to comments below):

df = pd.DataFrame(columns=np.arange(5))
df_list=  []
for i in np.arange(5):
    df.loc[0,:] = i
    df_list.append(df)
for d in df_list:
    print d
    print

edited May 23 '17 at 12:07

Community

1
1

answered May 26 '16 at 16:32

andrew

3,929
1
25
38

That makes sense. I will give it a try. So, in each step of the for loop, I would define `dfTripsStep = pd.DataFrame({})` and then make one row in it using something like `dfTripsStep.loc[0,'Start_Time'] = '3:00'` and then at the end of the for loop at each step I will do `dfTrips.append(dfTripsStep)` and finally outside of for loop I will do `pd.concat(fTrips)`. Does that sound reasonable to you to make these 1-row `dfTripsStep` in this manner at each step or there might be a better way to do it? – ahoosh May 26 '16 at 16:49
Using the method above, the time for passing 13 files decreases `2.9%` which is good, but no very significant. I assume if I have more files and `dfTrips` has more rows the difference will be more significant. – ahoosh May 26 '16 at 17:05
Is `dfTrips`a list in your version? It should not be a df (there is a df.append method, which is expensive). – andrew May 26 '16 at 17:09
Also, I would also growing `dfTripsStep` column-by-column. I would write a helper to initialize it all at once: temp_dat = generate_measures(dfTemp) # produces all your measures, same length as cols – andrew May 26 '16 at 17:11
dfTripsStep = pd.DataFrame(data=temp_dat, columns=cols) – andrew May 26 '16 at 17:11
Previous comment should have read "I would also avoid growing..." – andrew May 26 '16 at 17:18
Yes `dfTrips` is now an ampty list: `dfTrips = []` outside of the for loop. – ahoosh May 26 '16 at 17:22
I define an empty `dfTripsStep` with defined columns once outside of the for loop. Then, in the for loop, I replace the values in the `0th` row instead of defining a new one each time. That's faster I assume? – ahoosh May 26 '16 at 17:23
I see your thinking, but the problem is that when you append dfTripsStep to your list, you are simply adding the same pointer over and over. The object that pointer references is changing with each state. Thus, in the end you will have many copies of the same df in your list. – andrew May 26 '16 at 17:32
I posted an illustration in the answer above. – andrew May 26 '16 at 17:34
Great. I made an empty list `dfTrips = []` outside the for loop. This time I made a list `temp_dat = [1,2,3 ...]` from my features and then used `dfTripsStep = pd.DataFrame(data=[temp_dat], columns=cols)` and then `dfTrips.append(dfTripsStep)` and finally outside of the for loop, used `dfTrips = pd.concat(dfTrips)`. No improvement! – ahoosh May 26 '16 at 18:04
Interesting. In my past testing I have found `concat` to be about 130x faster than `append`. From the docs, expanding by `loc` seems to be equivalent to using the append method. So maybe the files are too small to notice, or maybe they have improved the methods performance. – andrew May 26 '16 at 18:11

score 1 · Accepted Answer · answered May 26 '16 at 18:02

1

I am not sure it will save time, but you can (kind of) avoid the loop by using groupby. First you would define a new column, say trip_number, to index each unique trip (this will probably still involve looping over tripNumbers). Then group by trip_number.

You could use apply to apply a single function to each group individually.

Finally you would use a horizontal concat to join them into your output df.

Refer to the 'Felixble apply' section of the doc.

grouped = df.groupby('trip_num')
decel_df = grouped.apply(get_deceleration_risk)
steer_df = grouped.apply(get_steering_risk)
...
dfTrips = pd.concat([decel_df, steer_df, ...], axis=1) 
dfTrips.columns = ['Harsh_Deceleration', 'Harsh_Steering', ...]

answered May 26 '16 at 18:02

andrew

3,929
1
25
38

If you try this method, please report back on the speed. I am very curious as to whether it significantly different. – andrew May 26 '16 at 18:24
1

I tried this method and it works. It's slightly faster than the for loop. The thing is I have 13 files and 5-6 trips in each file. I think if I had more trips in each file, this method would be significantly faster. Regardless, this is a cleaner method for sure. Thanks. – ahoosh May 26 '16 at 20:46
Interesting. I will have to do some time testing on larger dfs. Thanks for reporting back – andrew May 26 '16 at 22:21

Avoiding for loop in a pandas data frame when working on selected rows

2 Answers2