0

I have a dataset with 3 columns - DriverID, Race, Place. DriverID

 DriverID Race  Place
    83    1      1
    18    1      2
    20    1      3
    48    1      4
    53    1      5

For each race, I want to calculate a matrix (numpy array) of pairwise differences in the Place column across DriverID. The issue is that not all DriverID and races are represented. So I decided to first create a complete cross joined table for every combination of DriverID and Race as follows (reproducible example below):

url = "http://personal.psu.edu/drh20/code/btmatlab/nascar2002.txt"
races_trimmed = pd.read_table(url, sep=" ")

# Create a cartesian product of unique drivers and races to get every combination
unq_drivers = sorted(races_trimmed["DriverID"].unique())
unq_drivers = [x for x in unq_drivers if str(x) != 'nan']
unq_races = sorted(races_trimmed["Race"].unique())
unq_races = [x for x in unq_races if str(x) != 'nan']

# Get a dataframe 
unq_drivers_df = pd.DataFrame(unq_drivers, columns=["DriverID"])
unq_races_df = pd.DataFrame(unq_races, columns=["Race"])

# Let's cross join the columns to get all unique combinations of drivers and races
all_driver_race_combs = unq_drivers_df.assign(foo=1).merge(unq_races_df.assign(foo=1)).drop('foo', 1)
all_driver_race_combs = all_driver_race_combs.sort_values(by=['Race', 'DriverID'])
all_driver_race_mg = pd.merge(all_driver_race_combs, races_trimmed,  how='left', 
                              left_on=['DriverID','Race'], right_on = ['DriverID','Race'])

Now to get the pairwise differences I proceed as follows (using the approach from here:

# Now let's do a pairwise difference in finish across drivers for a 
# single race
# based on https://stackoverflow.com/questions/46266633/pandas-creating-difference-matrix-from-data-frame
race_num = 2.0
race_res = all_driver_race_mg[all_driver_race_mg["Race"] == race_num]
race_res = race_res.sort_values(by=['DriverID'])

arr = (race_res['Place'].values - race_res['Place'].values[:, None])
new_race_1 = pd.concat((race_res['DriverID'], pd.DataFrame(arr, columns=race_res['DriverID'])), axis=1)

# Remove the first column - it has the DriverID in the pairwise matrix
new_race_1 = new_race_1.values[:, 1:]
new_race_1.shape

You can see that this outputs (166, 83) array instead of (83, 83) array for race_num = 2.0. For race_num = 1.0 it works, but for all other races it does not. Could anyone explain how to correct the calc i.e. to output the 83 * 83 matrix for every valid race_num? I think it is the nan values but not sure how to fix this?

user4687531
  • 1,021
  • 15
  • 30
  • Not sure what is the problem, but what I'm sure is that there is no point of concatenating `race_res['DriverID']` with `pd.DataFrame(arr, columns=race_res['DriverID'])` if you remove the first column with `new_race_1.values[:, 1:]` after. At the end of your code, `new_race_1 == arr`, no? – Ben.T Nov 05 '18 at 15:02
  • Actually, what is wrong is a problem of index when performing the `concat`, you can solve it for example by defining the index when creating the DataFrame to be the same than `race_res['DriverID']` such as `pd.DataFrame(arr, columns=race_res['DriverID'], index=race_res['DriverID'].index)`. but again, I don't see the point of this step in the code – Ben.T Nov 05 '18 at 15:20

0 Answers0