2

I have a dataframe of race results (where each race has 14 participants) that looks like this:

df = race_id A0 B0 C0 A1 B1 C1 A2 B2 C2 ... A13 B13 C13 WINNER
       1     2   3 0  9  1   3  4  5 1       1   2   3   3
       2     1   5 2  7  3   2  8  6 0       6   4   1   9
       .....

I want to train the data on a multi logistic regression model. However, as the data currently stands, the model would be sensitive to permuting the participants. For example, if the model is given the record

race_id A0 B0 C0 A1 B1 C1 A2 B2 C2 ... A13 B13 C13 WINNER
3       9  1   3  2  3 0  4  5 1       1   2   3   3

Which is just changing participant 0 features into participant 1 features in race 1, the model would output a different prediction for the winner even though the input is the same.

So I want to generate a random 100 permutations for each race in the data with the same winner to train the model to adapt on permutations. How can I create these 100 sample permutations for this data frame (While preserving the A,B,C features of every racer?

AspiringMat
  • 2,161
  • 2
  • 21
  • 33
  • What do you mean by *While preserving the A,B,C features of every racer*? Shouldn't they also be randomized? – OfirD Feb 10 '20 at 15:48
  • @HeyJude Meaning we're randomizing the ABC features in block, not individually. So for example, I do not swap the first racer A feature with the second racer A feature and leave B and C intact. They get permuted in blocks of *features*, so swapping A,B,C features of the first racer and the second racer is valid. – AspiringMat Feb 10 '20 at 17:14
  • Got you. another clarification, if you don't mind: *the model would output a different prediction for the winner even though the input is the same* - it's not clear to me why isn't that a valid permutation (or: why is it the *same input*? after all, it changes racer 0 and racer 1 data). bottom line, would be nice if you could give an example for a valid permutation. – OfirD Feb 10 '20 at 17:40
  • 1
    "it's not clear to me why isn't that a valid permutation". It is a valid permutation. Sorry, English is not my first language. I meant to say that this is a valid permutation (the example I showed) and I consider it the *same* input (i.e race_id 1 and 3 are the same races/inputs), but a regular regression model would treat them as different. – AspiringMat Feb 10 '20 at 17:47

2 Answers2

1

Before we begin, this is not a good approach to modeling race outcomes.

However, if you want to do it anyway, you want to permute and remap the column names and then union together the resulting permutations. First, dymanically create a list of participants by parsing the column names:

participants = [col[1:] for col in df.columns if col.startswith('A')]

Then loop through permutations of these participants and apply the column name remapping:

import itertools


# Create an empty dataframe to hold our permuted races
races = pd.DataFrame()
for permutation in list(itertools.permutations(participants)):

  # Create the mapping of participants from the permutation
  mapping = {p:permutation[i] for i, p in enumerate(participants)}

  # From the participant mapping, create a column mapping
  columns = {}
  for col in df.columns:
    for old, new in mapping.items():
      if col.endswith(old):
        columns[col] = col.replace(old, new)

  # Remap column names
  race = df.rename(columns=columns)

  # Reassign the winner based on the mapping
  race['WINNER'] = race.apply(lambda row: mapping[row['WINNER']], axis=1)

  # Collect the races
  races = pd.concat([races, race])
Dave
  • 1,579
  • 14
  • 28
  • Thanks for the reply! Any more clarification on why you think this is not a good way of modelling the race? – AspiringMat Feb 10 '20 at 16:58
  • This model will have one observation per "race", so it will help to explain why one race has a different outcome than another race. Do you really care about this? Usually when someone models race outcomes, they build a model where there is one observation per "race performance". A model build like this will help to explain why one race performance has a different outcome than another. – Dave Feb 10 '20 at 17:19
  • Got it. So would another approach be modelling the individual racer's performance, and then applying say a softmax to the probability of them winning in a group of races? – AspiringMat Feb 10 '20 at 17:47
  • Logistic regression (or comparable) on win True/False. – Dave Feb 10 '20 at 18:20
0

Here's an option for the filling your dataframe with triplets permutations, where df is your dataframe (I left out the winner column mapping; see chunkwise implementation).

Note that rand_row is just a random row I made for the sake of example. It's filled with values from 1 to 10 (as in your given dataframe), and have 40 columns (1 for race id, plus 13*3 for each racer), but you can change it, of course:

import random
import itertools

def chunkwise(t, size=2):
    it = iter(t)
    return zip(*[it]*size)

def fill(df, size):
    rand_row = [random.randrange(1, 10) for _ in range(0, 13*3)]
    triplets = list(chunkwise(rand_row, 3))
    for i in range(size):
        shuffeled = random.sample(triplets, len(triplets))
        flattened = [item for triplet in shuffeled for item in triplet]
        df.loc[i] = [i+1] + flattened
    return df;
OfirD
  • 9,442
  • 5
  • 47
  • 90