3

I have a dataset of football matches as a Pandas dataframe in the form below:

    Home            Away             Result     HG  AG
0   Liverpool       Norwich City     Liverpool  4   1
1   West Ham        Man City         Man City   0   5
2   AFC Bournemouth Sheffield United Draw       1   1
3   Burnley         Southampton      Burnley    3   0
4   Crystal Palace  Everton          Draw       0   0

I'd like to track results by team in a dictionary of lists:
{'Liverpool': [W,W, ... ,W], 'West Ham': [W, D, L, ... ], ... } etc.

My approach naturally is to iterate over all the rows with conditionals (below is psuedo-code):

if df.Result == 'Draw':
    dict[df[Home]].append('D')
    dict[df[Away]].append('D')
elif df.Home == df.Result:
    dict[df[Home]].append('W')
    dict[df[Away]].append('L')
else:
    dict[df[Home]].append('L')
    dict[df[Away]].append('W')

I believe I can do this using df.iterrows() but I know this isn't the desired approach in general with Pandas. Is there a way I can do this kind of operation while taking advantage of the power of Pandas DataFrames? I have seen that df.Home == df.Result returns a series of True/False values but I don't know how to utilise this or extend it to the multiple conditions described above.

I have also seen np.where and np.select from this answer, but I don't see it being applicable to this situation where I want to do something based off the outcome of a conditional on each row, using an entry in the row as a key.

It feels as though iteration is the only solution here, I'm sure Pandas would support something like this but I have no idea how to search for it.

420fedoras
  • 315
  • 1
  • 4
  • 11

2 Answers2

3

You can naturally get a formula per team, something like this:

team_list = list(set(list(df.Home)+list(df.Away)))

d = {i:list(df.loc[(df.Home==i)|(df.Away==i),'Result'].map({i:'W','Draw':'D'}).fillna('L')) 
     for i in team_list}

Basically for each unique team name, we're getting a dataframe for games they were in, then taking the Results column and using map to turn their name into W and draw into D - anything else becomes nan so we use fillna to turn those into L and then use list().

out:

{'West Ham': ['L'],
 'Man City': ['W'],
 'Everton': ['D'],
 'Burnley': ['W'],
 'AFC Bournemouth': ['D'],
 'Liverpool': ['W'],
 'Norwich City': ['L'],
 'Southampton': ['L'],
 'Sheffield United': ['D'],
 'Crystal Palace': ['D']}

Timer Edit:

This solution seems ~10% slower than Chris A.'s, as tested on a random Home/Away/Result dataset with len 1,000,000:

%timeit d1(df) #d1 is Chris A.'s method
1.88 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit d2(df) #d2 is my dict comprehension
2.14 s ± 9.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Jim Eisenberg
  • 1,490
  • 1
  • 9
  • 17
  • 2
    This is a great one-liner and worked perfectly on my dataset, I could have easily accepted this answer too. I am interested how the two answers would compare speed-wise on larger data (my data is only 380 rows long because its for one season, but speed would be applicable on historic data). – 420fedoras Jan 24 '20 at 11:09
3

Here is one approach using DataFrame.melt, numpy.select and DataFrame.groupby with list aggregation.

N.B. I've added a rematch between "Liverpool" and "Norwich City" to your sample data to show how the output should look where a team appears more than once:

#Setup
df = pd.DataFrame({'Home': ['Liverpool', 'West Ham', 'AFC Bournemouth', 'Burnley', 'Crystal Palace', 'Norwich City'], 'Away': ['Norwich City', 'Man City', 'Sheffield United', 'Southampton', 'Everton', 'Liverpool'], 'Result': ['Liverpool', 'Man City', 'Draw', 'Burnley', 'Draw', 'Norwich City'], 'HG': [4, 0, 1, 3, 0, 4], 'AG': [1, 5, 1, 0, 0, 1]})

# Restructure the DataFrame into a long-form with "Melt"
df_melted = (df.reset_index()
             .melt(id_vars=['Result', 'index'],
                   value_vars=['Home', 'Away'])
             .sort_values('index')) # This maintains the match order of original df

# Use numpy.select to create your conditions and choices ('W', 'L' or 'D')
df_melted['outcome'] = np.select(
    condlist=[df_melted['Result'] == 'Draw',
              df_melted['Result'] == df_melted['value'],
              df_melted['Result'] != df_melted['value']],
    choicelist=['D', 'W', 'L'])


# Groupby team agg with list and return output as dict
df_melted.groupby('value', sort=False)['outcome'].apply(list).to_dict()

[out]

{'Liverpool': ['W', 'L'],
 'Norwich City': ['L', 'W'],
 'West Ham': ['L'],
 'Man City': ['W'],
 'AFC Bournemouth': ['D'],
 'Sheffield United': ['D'],
 'Burnley': ['W'],
 'Southampton': ['L'],
 'Crystal Palace': ['D'],
 'Everton': ['D']}
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
  • 2
    Accepted answer because of the more detailed approach, I will look into melt, apply, and groupby for future applications. This method actually found a key in my (full) data that shouldn't exist, while Jim's did not, which I find quite interesting – 420fedoras Jan 24 '20 at 11:06