0

I have a csv file (Delimited with '|') with data something like this:

|origin|destination|class|eff_start_date|eff_end_date|price
0|A|B|1|2016-01-01|2016-12-31|37.4
1|A|B|2|2016-01-01|2016-12-31|30.4
2|A|C|1|2016-01-01|2016-12-31|94.0
3|A|C|2|2016-01-01|2016-12-31|80.0
4|A|B|1|2017-01-01|2017-12-31|38.4
5|A|B|2|2017-01-01|2017-12-31|31.4
6|A|C|1|2017-01-01|2017-12-31|95.0
7|A|C|2|2017-01-01|2017-12-31|81.0
8|A|C|2|2017-02-01|2017-05-20|82.0

This file gives the fare for a particular origin, destination. eff_start_date & eff_end_date show the period for which a particular fare is valid.

I am writing a code to validate this file in a way that there should never be 2 separate fare values for any date. in the example above, records, 7 & 8 are conflicting as for same origin, destination & class, between February, 2017 to 20-may-2017 there may be 2 separate fare records.

Below is the snippet of my code but it didn't work, i am not able to figure out whats wrong.

import pandas as pd

df_fares = pd.read_csv('input.csv', sep='|')
df_fares.eff_start_date = pd.to_datetime(df_fares.eff_start_date, format='%Y-%m-%d')
df_fares.eff_end_date = pd.to_datetime(df_fares.eff_end_date, format='%Y-%m-%d')
df_fares['overlap'] = (df_fares.groupby(['origin', 'destination', 'class'])
            .apply(lambda x: (x['eff_start_date'].shift() - x['eff_end_date']) > timedelta(0))
            .reset_index(drop=True))

elyptikus
  • 936
  • 8
  • 24
Mohan
  • 4,677
  • 7
  • 42
  • 65

1 Answers1

0

I think you put the shift at the wrong position. This would then point out the rows, where the end date from the previous row and the start date from the current row are "overlapping".

import pandas as pd
from datetime import timedelta

df_fares = pd.read_csv('input.csv', sep='|')
df_fares.eff_start_date = pd.to_datetime(df_fares.eff_start_date, format='%Y-%m-%d')
df_fares.eff_end_date = pd.to_datetime(df_fares.eff_end_date, format='%Y-%m-%d')
df_fares['overlap'] = (df_fares.groupby(['origin', 'destination', 'class'])
                           .apply(lambda x: x['eff_start_date'] - x['eff_end_date'].shift() < timedelta(0))
                            .reset_index(drop=True))

But be careful, this is not the whole story yet. Your rows are not ordered, this means, that the evaluation depends on how the dataframe was loaded. Maybe it is worth checking out some of this ideas: Efficient date range overlap calculation in python?

There is another issue:

never be 2 separate fare values

You need to include a check if the values are the same or not. You could do this maybe by dropping all the rows, that are fine for you anyways before adding the overlap column:

df_fares = df_fares.drop_duplicates(subset=['origin', 'destination', 'class', 'price'])
elyptikus
  • 936
  • 8
  • 24
  • this did not work. I am still getting false on the overlap columns – Mohan Oct 20 '21 at 10:52
  • The code should give you the output `False` for all the rows except the last one `8|A|C|2|2017-02-01|2017-05-20|82.0` should be `True`. Is this not the case for you? – elyptikus Oct 20 '21 at 12:11