I am working on a healthcare dataset from Kaggle (https://www.kaggle.com/joniarroba/noshowappointments), which contains information about medical appointments in Brazil and whether or not the patient showed up. The dataset has columns for appointment ID, patient ID, appointment date and time, scheduled date and time, and several other features.
I want to add a column to the DataFrame that shows the no-show rate for each appointment based on the patient's previous appointments. For example, if a patient has had three appointments and showed up for two of them, the no-show rate for their fourth appointment would be 1/3. If the patient is a first-time appointment, the no-show rate would be 0.
# convert appointment and scheduled dates to datetime format
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
# create a new column with the time difference between scheduled and
appointment date
df['time_diff'] = (df['AppointmentDay'].dt.date -
df['ScheduledDay'].dt.date).dt.days
# group by PatientId and calculate no-show count and appointment count
for each group
grouped = df.groupby('PatientId')['No-show'].apply(lambda x:
x.eq('Yes').cumsum().shift().fillna(0))
df['no_show_count'] = grouped
df['appointment_count'] = grouped + df.groupby('PatientId').cumcount()
# calculate no-show rate for each patient
df['no_show_rate'] = df['no_show_count'] / df['appointment_count']
# replace NaN values in 'no_show_rate' column with 0
df['no_show_rate'] = df['no_show_rate'].fillna(0)
# print first 5 rows
print(df.head())
the problem is in this code is it calculate current appointment. for
example if you
df[df['PatientId'] ==
112397157856688.0].sort_values('AppointmentDay') ,
you will understand
better what i mean