Matching Closest Date From Same Data Frame

Question

I'm working with data about the attendance of major league baseball game attendance.

I'm trying to create a new column in my data frame that returns the closest date (but cannot be after given date) of a specified rival team's game.

For example, for a row that has data about a Los Angeles Angels game:

Game_Num      Date         Team        Win      Attendance      Net Wins
23            2010-04-05   LAA         1        43504           12

I would want to find the closest previous date for a Los Angeles Dodgers ('LAD') game and attach it in a new column.

My end goal is to create another column that shows what the rival team's net win is going into a game so that I can see if another team is having a good season if it affects ticket sales.

This is what I have tried so far:

for index, row in bbattend.iterrows():
    if row['Team'] == 'LAA':
        basedate = row['Date']
        tempdf = bbattend.loc[(bbattend['Team'] == 'LAD') & (bbattend['Date'] < basedate)]
        tempdf['Datediff'] = abs(basedate-tempdf['Date']).days
        mindiff = tempdf['Datediff'].min()
        bbattend['CloseRivalDate'] = tempdf[tempdf['Date']==mindiff]['Date']
        bbattend['RivalNetWins'] = tempdf[tempdf['Date']==mindiff]['Net_Wins']
        bbattend['RivalWinPer'] = tempdf[tempdf['Date']==mindiff]['Win_Per']

This is the error that I get from it:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-128-f2be88528772> in <module>
      3         basedate = row['Date']
      4         tempdf = bbattend.loc[(bbattend['Team'] == 'LAD') & (bbattend['Date'] < basedate)]
----> 5         tempdf['Datediff'] = abs(basedate-tempdf['Date']).days
      6         mindiff = tempdf['Datediff'].min()
      7         bbattend['CloseRivalDate'] = tempdf[tempdf['Date']==mindiff]['Date']

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5066                 return self[name]
-> 5067             return object.__getattribute__(self, name)
   5068 
   5069     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'days'

Here is my code for my data frame in case it would help at all:

import requests
import pandas as pd
import numpy as np
from datetime import datetime
import re

Teams = ['LAA', 'LAD', 'NYY', 'NYM', 'CHC', 'CHW', 'OAK', 'SFG']
Years = []
for year in range(2010,2020):
    Years.append(str(year))

list_of_df = list()

for team in Teams:
    for year in Years:
        url = 'https://www.baseball-reference.com/teams/' + team + '/' + year +'-schedule-scores.shtml'
        dfname = team + '_' + year
        html = requests.get(url).content
        df_list = pd.read_html(html)
        df = df_list[-1]

        #Formatting data table
        df.rename(columns={"Gm#": "GM_Num", "Unnamed: 4": "Home", "Tm": "Team", "D/N": "Night"}, inplace = True)
        df['Home'] = df['Home'].apply(lambda x: 0 if x == '@' else 1)
        df['Game_Win'] = df['W/L'].astype(str).str[0]
        df['Game_Win'] = df['Game_Win'].apply(lambda x: 0 if x == 'L' else 1)
        df['Night'] = df['Night'].apply(lambda x: 1 if x == 'N' else 0)
        df['Streak'] = df['Streak'].apply(lambda x: -1*len(x) if '-' in x else len(x))
        df.drop('Unnamed: 2', axis=1, inplace = True)
        df.drop('Orig. Scheduled', axis=1, inplace = True)
        df.drop('Win', axis=1, inplace = True)
        df.drop('Loss', axis=1, inplace = True)
        df.drop('Save', axis=1, inplace = True)
        #Drop rows that do not have data
        df = df[df['GM_Num'].str.isdigit()]
        WL = df["W-L"].str.split("-", n = 1, expand = True)
        df["Wins"] = WL[0].astype(dtype=np.int64)
        df["Losses"] = WL[1].astype(dtype=np.int64)
        df['Net_Wins'] = df['Wins'] - df['Losses']
        df['Win_Per'] = df['Wins']/(df['Wins']+df['Losses'])
        DayDate = df['Date'].str.split(", ", n = 1, expand = True)
        df['DayOfWeek'] = DayDate[0]
        df['Date'] = DayDate[1] + ', ' + year
        df['Date'] = [re.sub("\s\(\d+\)", "", str(x)) for x in df['Date']]
        df['Date'] = pd.to_datetime(df['Date'], format='%b %d, %Y')
        list_of_df.append(df)

bbattend = pd.concat(list_of_df)
bbattend

I know this definitely isn't the most efficient way to do this, but it gets the result that I want.

Have you happened to see this [thread1](https://stackoverflow.com/questions/17249220/getting-the-closest-date-to-a-given-date) or [thread2](https://stackoverflow.com/questions/32237862/find-the-closest-date-to-a-given-date)? — Milad Sikaroudi, Nov 10 '19 at 00:27
Yes, I added something based on those threads to my code, but I'm having errors that I still can't work out — Laurel, Nov 10 '19 at 01:29

score 0 · Accepted Answer · answered Nov 10 '19 at 14:56

The particular error you are seeing can be fixed by replacing the line

tempdf['Datediff'] = abs(basedate-tempdf['Date']).days

by

tempdf['Datediff'] = abs(basedate-tempdf['Date']).dt.days

However, your code will still produce incorrect results.

In my view, the proper way to do what you want is to perform a self-join on your table in which you merge the Team and Opp, then calculate the date difference between the Team's game date and every Opp's game in the season, filter for Opp's games that occur before the Team's game, and, finally, keep only the closest Opp's game to the Team's game.

Here's code that does that:

# Create column Year to help on self-join
bbattend["Year"] = bbattend.Date.dt.year

# Create merged table
merged = bbattend.merge(
    bbattend[["Date", "Year", "Team", "Net_Wins", "Win_Per"]],
    how="inner",
    left_on=["Year", "Opp"],
    right_on=["Year", "Team"],
    suffixes=('', '_opp')
)
merged["date_diff"] = (merged.Date - merged.Date_opp).dt.days

# Keep only closest game from Opp
def get_closest_date(g):
    row_to_keep = g.date_diff.idxmin()
    return g.loc[row_to_keep, ["Date_opp", "Team_opp", "Net_Wins_opp", "Win_Per_opp", "date_diff"]]

merged.groupby(bbattend.columns.to_list()).apply(get_closest_date).reset_index()

The resulting table will have added columns with the Opp team's Net_wins and Win_per after their previous game.

Notes:

Because of the inner join used, the table will only gave games for which both the Team and Opp are in your original table's Team column.
The resulting table has pairs of symmetric rows, where the Team and Opp switch. If you want to get rid of those, just filter for Home == 1

Finally, this is really pushing the limits of what I would do in pandas. If your tables grow larger, the self join will require quadratically more memory. I suggest you load this table into a relational database (sqlite would be the simplest to use, as it comes with Python), and do this computation using SQL.

Thank you so much. I modified the code a little bit because I want to match the info the same market rival (match Los Angeles Angels to Los Angeles Dodgers) so I created a new column called Same_Mkt_Team specifying the rival, and then replaced ```left_on=["Year", "Opp"]``` with ```left_on=["Year", "Same_Mkt_Team"]``` , however, when I'm looking at the resulting dataframe the date_diff column seems really off and the dates that get matched aren't very close together. The same thing happened when I didn't modify your code at all. — Laurel, Nov 10 '19 at 17:20
Also, when I add ```merged = merged[merged['date_diff'] > 0]``` after creating the 'date_diff' column, the final data seems wrong, because when you sort it by date, there are duplicate games — Laurel, Nov 10 '19 at 18:03
Are the duplicate games the symmetry aspect I mentioned in the answer or something else? — foglerit, Nov 10 '19 at 22:55

score 0 · Answer 2 · answered Nov 10 '19 at 19:38

This is the final code that I ended up using: It is based on @foglerit's answer

#Create game_id which will be used to delete duplicates later
bbattend['game_id'] = bbattend['Team'] + bbattend['Date'].astype(str)
#Create year variable for matching
bbattend['Year'] = bbattend.Date.dt.year

# Create merged table
# Will match all dates of games of team with dates within same year of teams from same-market team 
merged = bbattend.merge(
    bbattend[["Date", "Year", "Team", "Net_Wins", "Win_Per"]],
    how="inner",
    left_on=["Year", "Same_Mkt_Team"],
    right_on=["Year", "Team"],
    suffixes=('', '_Same_Mkt_Team')
)

merged["date_diff"] = (merged.Date - merged.Date_Same_Mkt_Team).dt.days
#Only keep the dates of same-market team that occurred before the date of home team's game
merged = merged[merged['date_diff'] > 0]

#Sort by date_diff so closest dates appear first
merged.sort_values(by='date_diff', inplace = True)

#Only keep first game_id which will include the data of the same-market team for the closest date before the game
merged.drop_duplicates(subset =['game_id'], keep = 'first', inplace = True)

merged

I added the condition that date_diff has to be positive because I wanted dates of games for the same-market team that occurred before the game.

Then I sorted the dataframe by date_diff and dropped dupicates of the game_id so that the final dataframe only had the minimum date_diff.

Matching Closest Date From Same Data Frame

2 Answers2