Pairwise differences in column in Pandas

Question

I have a dataset with 3 columns - DriverID, Race, Place. DriverID

 DriverID Race  Place
    83    1      1
    18    1      2
    20    1      3
    48    1      4
    53    1      5

For each race, I want to calculate a matrix (numpy array) of pairwise differences in the Place column across DriverID. The issue is that not all DriverID and races are represented. So I decided to first create a complete cross joined table for every combination of DriverID and Race as follows (reproducible example below):

url = "http://personal.psu.edu/drh20/code/btmatlab/nascar2002.txt"
races_trimmed = pd.read_table(url, sep=" ")

# Create a cartesian product of unique drivers and races to get every combination
unq_drivers = sorted(races_trimmed["DriverID"].unique())
unq_drivers = [x for x in unq_drivers if str(x) != 'nan']
unq_races = sorted(races_trimmed["Race"].unique())
unq_races = [x for x in unq_races if str(x) != 'nan']

# Get a dataframe 
unq_drivers_df = pd.DataFrame(unq_drivers, columns=["DriverID"])
unq_races_df = pd.DataFrame(unq_races, columns=["Race"])

# Let's cross join the columns to get all unique combinations of drivers and races
all_driver_race_combs = unq_drivers_df.assign(foo=1).merge(unq_races_df.assign(foo=1)).drop('foo', 1)
all_driver_race_combs = all_driver_race_combs.sort_values(by=['Race', 'DriverID'])
all_driver_race_mg = pd.merge(all_driver_race_combs, races_trimmed,  how='left', 
                              left_on=['DriverID','Race'], right_on = ['DriverID','Race'])

Now to get the pairwise differences I proceed as follows (using the approach from here:

# Now let's do a pairwise difference in finish across drivers for a 
# single race
# based on https://stackoverflow.com/questions/46266633/pandas-creating-difference-matrix-from-data-frame
race_num = 2.0
race_res = all_driver_race_mg[all_driver_race_mg["Race"] == race_num]
race_res = race_res.sort_values(by=['DriverID'])

arr = (race_res['Place'].values - race_res['Place'].values[:, None])
new_race_1 = pd.concat((race_res['DriverID'], pd.DataFrame(arr, columns=race_res['DriverID'])), axis=1)

# Remove the first column - it has the DriverID in the pairwise matrix
new_race_1 = new_race_1.values[:, 1:]
new_race_1.shape

You can see that this outputs (166, 83) array instead of (83, 83) array for race_num = 2.0. For race_num = 1.0 it works, but for all other races it does not. Could anyone explain how to correct the calc i.e. to output the 83 * 83 matrix for every valid race_num? I think it is the nan values but not sure how to fix this?

Not sure what is the problem, but what I'm sure is that there is no point of concatenating `race_res['DriverID']` with `pd.DataFrame(arr, columns=race_res['DriverID'])` if you remove the first column with `new_race_1.values[:, 1:]` after. At the end of your code, `new_race_1 == arr`, no? — Ben.T, Nov 05 '18 at 15:02
Actually, what is wrong is a problem of index when performing the `concat`, you can solve it for example by defining the index when creating the DataFrame to be the same than `race_res['DriverID']` such as `pd.DataFrame(arr, columns=race_res['DriverID'], index=race_res['DriverID'].index)`. but again, I don't see the point of this step in the code — Ben.T, Nov 05 '18 at 15:20

Pairwise differences in column in Pandas

0 Answers0