0

I'm trying to work out the end date and have succeeded but code takes long to run. How can I improve the following code? Also df['end_date'] is a new variable? i tried: df['end_date'] = [], and appending it but getting a length error. I therefore wrote the below Many thanks, d

i = 0
j = 0
df['end_date'] = df['start_date']
for i in range(len(df['start_date'])):
    for j in range(len(df['term'])):
        df['end_date'] = (df['start_date'][i].date() +  dt.timedelta(df['term'][j]*365/12))
        i+=1
        j+=1

my dataset looks like :

start_date term

  1. 2010-03-01 24
  2. 2009-11-01 36
  3. 2012-08-01 24
Jienkles
  • 133
  • 9
  • 2
    can you add an example of input data? couple rows would be enough – Marat May 06 '20 at 18:18
  • 1
    If you want to do this *correctly* you will probably need to loop. That is, a month is not a well-defined unit of time and adding calendar months is not vectorized for different offsets. Sure you can add 365/12 days, but that can be sloppy and lead to issues. If you truly want to add a calendar month (such that Feb 12 + 1 month is March 12, and not like 4 PM on March 14th) then you can loop over the unique month offsets, instead of rows in the DataFrame. It's a much more manageable number of iterations – ALollz May 06 '20 at 18:24
  • I.e. See something like https://stackoverflow.com/questions/58174267/computing-age-from-to-timedelta-is-weird-and-dateoffset-is-not-scalable-over-a. That uses `years` as the offset, but you could easily change it to `months` for calendar month addition – ALollz May 06 '20 at 18:46

2 Answers2

2

Assuming 'end_date' is a datetime, something like this should work:

df['end_date'] = df['start_date'] + pd.to_timedelta(df['term']*365/12, unit='d')

However, the proper way of handling month offsets would be to use pd.DateOffset:

# if the offset was the same for all records, it would be ok to just
# offsets = pd.DateOffset(months=the_offset)
# but here, we need to handle different values
offsets = df['term'].map(lambda term: pd.DateOffset(months=term))
df['end_date'] = df['start_date'] + offsets
Marat
  • 15,215
  • 2
  • 39
  • 48
0

If I understand correctly what you want to achieve, you should be able to do this:

df['end_date'] = df['start_date'].date() +  dt.timedelta(df['term']*365/12)

You are looping over a dataframe, which you should avoid. Try to use the pandas API directly as much as possibly to use its underlying performance optimizations.

Also, you are manually incrementing i and j, but these are also the indices in your for loops and therefore incremented by the loop operator. So it should not be necessary to increment them manually.

Fabian Hertwig
  • 1,093
  • 13
  • 27
  • 1
    `dt.timedelta` won't work on pd.Series. Use pd.to_timedelta for this – Marat May 06 '20 at 18:24
  • @Fabian Hertwig thanks, though i get a AttributeError: 'Series' object has no attribute 'date' – Jienkles May 06 '20 at 18:48
  • @Marat thanks . do you mean on the start date variable, i get TypeError: dtype datetime64[ns] cannot be converted to timedelta64[ns] – Jienkles May 06 '20 at 18:50
  • @Jienkles if `start_date` is a string, use `pd.to_datetime(df['start_date']).dt.date` instead. However, it would be a lot better to use `pd.DateOffset` to work with month offsets instead – Marat May 06 '20 at 19:14
  • @Jienkles if it is a datetime column (seems to be the case), then just `df['start_date'].dt.date` – Marat May 06 '20 at 19:15
  • @Marat no it was a datetime already thanks – Jienkles May 06 '20 at 19:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/213264/discussion-between-jienkles-and-marat). – Jienkles May 06 '20 at 19:21
  • @ALollz thanks but i get the same as the other errors, which is AttributeError: 'DataFrame' object has no attribute 'date' – Jienkles May 06 '20 at 19:28
  • @Marat gives error PerformanceWarning, but works, so im happy thanks a million – Jienkles May 06 '20 at 19:45