How can I replace a loop to filter, splice, and process a dataframe to improve performance?

Question

I'd like to improve my code so that it's more pythonic and enhances the speed at which data is processed. The current code works, but I'm sure this can be improved somewhat. The .csv file is 702 MB, so it's taking about 7-10 minutes for me to get to the end result:

def delayed_vs_punctual(self, df):
    filtered_for_carriers = df['UniqueCarrier']
    number_of_entries_each_carrier = filtered_for_carriers.value_counts()
    carriers = number_of_entries_each_carrier.index

    percent_delayed_all = []
    for carrier in carriers:  
        total_number_of_carrier = number_of_entries_each_carrier[carrier]
        mask = df.loc[df['UniqueCarrier'] == carrier]

        d = 0
        for index, row in mask.iterrows():
            ArrDelay = row['ArrDelay']
            if ArrDelay > 0:
                d += 1
            else:
                pass
        percent_delayed = d/total_number_of_carrier
        percent_delayed_all.append(percent_delayed)

    percentage_delay_dict = dict(zip(carriers, percent_delayed_all))  

    return percent_delayed_all

I'm quite sure looping is not the best method. Anyway, sample data:

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0
2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31
2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3

I'm filtering through and splicing the DF so that it is grouped into airlines (UniqueCarrier). Then in each of these new mini-DFs (still pretty large though), I'm checking each row for a certain condition, like if there was a delay. Then a percentage is calculated (delayed flights against total flights for that particular airline). The end result is a dict:

percent_delayed_all = {'YV': 0.42212989448366295, 'US': 0.53435287477314719, 'MQ': 0.46239551225360503, 'AA': 0.49731090766529357, \
    'FL': 0.43394297743949478, 'NW': 0.56168732479989192, 'HA': 0.25596795727636851, 'F9': 0.50444967266775775, \
    'WN': 0.41947657183726861, 'OH': 0.50945518784192445, 'OO': 0.46118130333410273, '9E': 0.41249599190267761, \
    'B6': 0.45879864194306608, 'UA': 0.47631438239027596, 'AS': 0.47851546649186877, 'CO': 0.45207967792146703, \
    'AQ': 0.27577653149266607, 'XE': 0.40724700015870352, 'EV': 0.52604861756464993, 'DL': 0.45727049795225355}

From the dict you can see the airline as the key, and the percentage of delayed flights, so FL would be 43%, for example. Delayed in this case means > 0 minutes more than the expected arrival time.

Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. — MaxU - stand with Ukraine, Sep 21 '17 at 20:03

score 1 · Accepted Answer · answered Sep 21 '17 at 20:23

1

df.ArrDelay.gt(0).groupby(df.UniqueCarrier).mean().to_dict()

answered Sep 21 '17 at 20:23

piRSquared

285,575
57
475
624

I'm not sure why this question was marked down, since the fundamental reason behind my question is related to optimising the performance in pandas using a different method to just looping through. Anyway thank @piRSquared. My method yielded a processing time of 595.2 s, whereas piRSquared's method was just 0.3 s. – pymat Sep 25 '17 at 04:42
@piRSqaured: just one issue with your answer: why is there the mean()? – pymat Sep 25 '17 at 05:15
It's a convenient way to divide the count of delays by the total number – piRSquared Sep 25 '17 at 05:21
I thought mean returned the average (mean)? I'd like the percentage (count of delays for an airline / total count of flights for that airline). I understand that I'd have to perform two lines with "groupby". – pymat Sep 25 '17 at 05:27
When performing the `gt(0)` we get a series of boolean values (`True`/`False`) indicating there was a delay when `True` and not otherwise. Turns out that boolean values are a subclass of integers and `True` evaluates to `1` and `False` evaluates to `0` in an integer or float context. So when I call `mean` on a group of `True`/`False`, its as if I'm calling `mean` on a group of `1`/`0` and if I look at what this formulates to, I'll see that it is the sum of the number of times `ArrDelay` is greater than zero divided by the the total number of times the airline flew. Exactly what's needed. – piRSquared Sep 25 '17 at 05:45
Perfect! Thank you for the explanation. I'd mark you up twice if I could. – pymat Sep 25 '17 at 05:47
Glad I could help! – piRSquared Sep 25 '17 at 05:51

How can I replace a loop to filter, splice, and process a dataframe to improve performance?

1 Answers1