Python - speed up Pandas iteration

Question

Sorry for my bad english.

This is a simplified version of my dataframe:

d = {'League': {5697: 'Premier League', 5695: 'Premier League', 5694: 'Premier League', 5693: 'Premier League', 5692: 'Premier League', 5691: 'Premier League', 5696: 'Premier League', 5689: 'Premier League', 5688: 'Premier League', 5690: 'Premier League', 5684: 'Premier League', 5680: 'Premier League', 5681: 'Premier League', 5682: 'Premier League', 5686: 'Premier League', 5685: 'Premier League', 5687: 'Premier League', 5683: 'Premier League', 5678: 'Premier League', 5679: 'Premier League', 5677: 'Premier League', 5674: 'Premier League', 5676: 'Premier League', 5675: 'Premier League', 5673: 'Premier League', 5670: 'Premier League', 5672: 'Premier League', 5671: 'Premier League', 5665: 'Premier League', 5667: 'Premier League', 5666: 'Premier League', 5669: 'Premier League', 5664: 'Premier League', 5663: 'Premier League', 5662: 'Premier League', 5668: 'Premier League', 5661: 'Premier League', 5660: 'Premier League', 5654: 'Premier League', 5653: 'Premier League', 5655: 'Premier League', 5656: 'Premier League', 5657: 'Premier League', 5658: 'Premier League', 5659: 'Premier League', 5652: 'Premier League', 5651: 'Premier League', 5650: 'Premier League', 5649: 'Premier League', 5642: 'Premier League'}, 'Date': {5697: '2003-08-16 13:30', 5695: '2003-08-16 16:00', 5694: '2003-08-16 16:00', 5693: '2003-08-16 16:00', 5692: '2003-08-16 16:00', 5691: '2003-08-16 16:00', 5696: '2003-08-16 16:00', 5689: '2003-08-17 15:00', 5688: '2003-08-17 15:00', 5690: '2003-08-17 17:05', 5684: '2003-08-23 13:00', 5680: '2003-08-23 16:00', 5681: '2003-08-23 16:00', 5682: '2003-08-23 16:00', 5686: '2003-08-23 16:00', 5685: '2003-08-23 16:00', 5687: '2003-08-23 16:00', 5683: '2003-08-23 16:00', 5678: '2003-08-24 15:00', 5679: '2003-08-24 17:05', 5677: '2003-08-25 21:00', 5674: '2003-08-26 20:45', 5676: '2003-08-26 20:45', 5675: '2003-08-26 20:45', 5673: '2003-08-26 21:00', 5670: '2003-08-27 20:45', 5672: '2003-08-27 21:00', 5671: '2003-08-27 21:00', 5665: '2003-08-30 13:30', 5667: '2003-08-30 16:00', 5666: '2003-08-30 16:00', 5669: '2003-08-30 16:00', 5664: '2003-08-30 16:00', 5663: '2003-08-30 16:00', 5662: '2003-08-30 16:00', 5668: '2003-08-30 16:00', 5661: '2003-08-31 15:00', 5660: '2003-08-31 17:05', 5654: '2003-09-13 16:00', 5653: '2003-09-13 16:00', 5655: '2003-09-13 16:00', 5656: '2003-09-13 16:00', 5657: '2003-09-13 16:00', 5658: '2003-09-13 16:00', 5659: '2003-09-13 16:00', 5652: '2003-09-14 15:00', 5651: '2003-09-14 17:05', 5650: '2003-09-15 21:00', 5649: '2003-09-20 13:30', 5642: '2003-09-20 16:00'}, 'HomeTeam': {5697: 'Portsmouth', 5695: 'Leicester', 5694: 'Fulham', 5693: 'Blackburn', 5692: 'Birmingham', 5691: 'Arsenal', 5696: 'Manchester United', 5689: 'Leeds', 5688: 'Charlton', 5690: 'Liverpool', 5684: 'Newcastle', 5680: 'Bolton', 5681: 'Chelsea', 5682: 'Everton', 5686: 'Tottenham', 5685: 'Southampton', 5687: 'Wolves', 5683: 'Manchester City', 5678: 'Aston Villa', 5679: 'Middlesbrough', 5677: 'Blackburn', 5674: 'Leeds', 5676: 'Portsmouth', 5675: 'Leicester', 5673: 'Charlton', 5670: 'Arsenal', 5672: 'Manchester United', 5671: 'Liverpool', 5665: 'Everton', 5667: 'Newcastle', 5666: 'Middlesbrough', 5669: 'Wolves', 5664: 'Chelsea', 5663: 'Bolton', 5662: 'Aston Villa', 5668: 'Tottenham', 5661: 'Southampton', 5660: 'Manchester City', 5654: 'Blackburn', 5653: 'Arsenal', 5655: 'Bolton', 5656: 'Charlton', 5657: 'Chelsea', 5658: 'Everton', 5659: 'Southampton', 5652: 'Manchester City', 5651: 'Birmingham', 5650: 'Leicester', 5649: 'Wolves', 5642: 'Aston Villa'}, 'AwayTeam': {5697: 'Aston Villa', 5695: 'Southampton', 5694: 'Middlesbrough', 5693: 'Wolves', 5692: 'Tottenham', 5691: 'Everton', 5696: 'Bolton', 5689: 'Newcastle', 5688: 'Manchester City', 5690: 'Chelsea', 5684: 'Manchester United', 5680: 'Blackburn', 5681: 'Leicester', 5682: 'Fulham', 5686: 'Leeds', 5685: 'Birmingham', 5687: 'Charlton', 5683: 'Portsmouth', 5678: 'Liverpool', 5679: 'Arsenal', 5677: 'Manchester City', 5674: 'Southampton', 5676: 'Bolton', 5675: 'Middlesbrough', 5673: 'Everton', 5670: 'Aston Villa', 5672: 'Wolves', 5671: 'Tottenham', 5665: 'Liverpool', 5667: 'Birmingham', 5666: 'Leeds', 5669: 'Portsmouth', 5664: 'Blackburn', 5663: 'Charlton', 5662: 'Leicester', 5668: 'Fulham', 5661: 'Manchester United', 5660: 'Arsenal', 5654: 'Liverpool', 5653: 'Portsmouth', 5655: 'Middlesbrough', 5656: 'Manchester United', 5657: 'Tottenham', 5658: 'Newcastle', 5659: 'Wolves', 5652: 'Aston Villa', 5651: 'Fulham', 5650: 'Leeds', 5649: 'Chelsea', 5642: 'Charlton'}, 'GTOT': {5697: 3, 5695: 4, 5694: 5, 5693: 6, 5692: 1, 5691: 3, 5696: 4, 5689: 4, 5688: 3, 5690: 3, 5684: 3, 5680: 4, 5681: 3, 5682: 4, 5686: 3, 5685: 0, 5687: 4, 5683: 2, 5678: 0, 5679: 4, 5677: 5, 5674: 0, 5676: 4, 5675: 0, 5673: 4, 5670: 2, 5672: 1, 5671: 0, 5665: 3, 5667: 1, 5666: 5, 5669: 0, 5664: 4, 5663: 0, 5662: 4, 5668: 3, 5661: 1, 5660: 3, 5654: 4, 5653: 2, 5655: 2, 5656: 2, 5657: 6, 5658: 4, 5659: 2, 5652: 5, 5651: 4, 5650: 4, 5649: 5, 5642: 3}}

df = pd.DataFrame.from_dict(d)

Where GTOT is the total number of goals in the match.

For every team in the competition I want to find out the number of consecutive matches where GTOT is minor 1 and add it to a list.

I use itertuples to iterate over the dataframe, but it's very slow for a large number of matches (> 50000) and teams (> 100). This is a semplified version of my function:

def calculates_overdue():
    lstTeams = [team for team in df.HomeTeam.unique()]
    lstTeams.sort()
    lstOverdue = []
    for team in lstTeams:
        overdue = 0
        mask = (df.HomeTeam == team) | (df.AwayTeam == team)
        for row in df[mask].itertuples():
            if row.GTOT < 1:
                lstOverdue.append((row.League, team, overdue, row.Date))
                overdue = 0
            else:
                overdue += 1
    return lstOverdue

print(calculates_overdue())

Result:

[('Premier League', 'Aston Villa', 4, '2003-08-24 15:00'), ('Premier League', 'Birmingham', 2, '2003-08-23 16:00'), ('Premier League', 'Bolton', 1, '2003-08-30 16:00'), ('Premier League', 'Charlton', 2, '2003-08-30 16:00'), ('Premier League', 'Leeds', 2, '2003-08-26 20:45'), ('Premier League', 'Leicester', 2, '2003-08-26 20:45'), ('Premier League', 'Liverpool', 2, '2003-08-27 21:00'), ('Premier League', 'Liverpool', 0, '2003-08-24 15:00'), ('Premier League', 'Middlesbrough', 2, '2003-08-26 20:45'), ('Premier League', 'Portsmouth', 1, '2003-08-30 16:00'), ('Premier League', 'Southampton', 2, '2003-08-26 20:45'), ('Premier League', 'Southampton', 0, '2003-08-23 16:00'), ('Premier League', 'Tottenham', 2, '2003-08-27 21:00'), ('Premier League', 'Wolves', 2, '2003-08-30 16:00')]

I know it's bad iterate over rows. How can I speed up these loops? Is it possible vectorize this function?

more importantly, you can provide some code to construct a sample dataframe, so that people can not just look the dataframe printout, but also run your code. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Evgeny, Jun 15 '18 at 13:01
@EvgenyPogrebnyak normally I would agree, but `pandas` has a nice builtin way to import dataframes from SO questions: `df = pd.read_clipboard(sep="\s\s+")` (I just had to adapt `sep` to not get fooled by single spaces in each cell) — filippo, Jun 15 '18 at 14:16
@filippo: even though `pd.read_clipboard` is a nice feature, providing runnable code seems a responsiility of the OP, with a sample data included. you would still persist the dataframe in code (eg with `.to_dict()`), if you are editing/running it locally, to make the example reproducible. I would agree a printout is better than nothing, but it does not make a perfect question that is easy to deal with. — Evgeny, Jun 15 '18 at 14:52
@EvgenyPogrebnyak completely agree! it's just I've seen a lot worse (screenshots?) here than I've grown to accept somewhat parsable printouts :-) — filippo, Jun 15 '18 at 15:00
Sorry for my bad code and thank you for your time! I added a sample of dataframe in dict form and the result. — Marco Montresor, Jun 15 '18 at 17:29
Did you want to count the number of consecutive non-scoring matches per team, or did you want a list of non-scoring matches per team? — aikramer2, Jun 15 '18 at 17:57
I want the number of consecutive non-scoring matches per team. — Marco Montresor, Jun 15 '18 at 18:54

sonus21 · Answer 1 · 2018-06-16T06:59:04.330

Use simple python call and remove some of pandas call.

def testTime():
        df = pd.DataFrame.from_dict(d)
        #print(df)
        #print(d.keys())
        lstOverdue = []
        lstTeams = [team for team in df.HomeTeam.unique()]
        for team in lstTeams:
                mask = (df.HomeTeam == team) | (df.AwayTeam == team)
                overdue = 0
                for row in df[mask].itertuples():
                        if row.GTOT < 1:
                                lstOverdue.append((row.League, team, overdue, row.Date))
                                overdue = 0
                        else:
                                overdue += 1
def testTime2():
        df = pd.DataFrame.from_dict(d)
        lstOverdue = []
        for team in df.HomeTeam.unique():
                overdue = 0
                for row in df.itertuples():
                        if row.HomeTeam == team or team == row.AwayTeam:
                                if row.GTOT < 1:
                                        lstOverdue.append((row.League, team, overdue, row.Date))
                                        overdue = 0
                                else:
                                        overdue += 1

if __name__ == '__main__':
        import timeit
        print(timeit.timeit("testTime2()", setup="from __main__ import testTime2", number=1000))
        print(timeit.timeit("testTime()", setup="from __main__ import testTime", number=1000))

see function testTime2, after running this code I observed around half of running time on my system with the data you've provided, but it can vary on your system as well.

$ python3.6 test.py
21.352469262998056
39.04840800600141
$ python3.6 test.py
22.977724283999123
39.568580347000534
$ python3.6 test.py
21.838805349998438
39.85868542999742

This can be further reduced to number of data rows but it requires some preprocessing as follows.

Convert dictionary to list of namedtuples during conversion create set of HomeTown.
Process the list similar to the function testTime2 process data frames.

@EvgenyPogrebnyak not sure why we have to do that, anyway changed to text. — sonus21, Jun 16 '18 at 06:59
Hmm...i try your funct, but without mask for df the execution time for DataFrame with 20000 rows increase x5. — Marco Montresor, Jun 16 '18 at 15:17

Python - speed up Pandas iteration

1 Answers1