1

I have a csv in which I have two columns representing start date: st_dt and end date: 'end_dt` , I have to subtract these columns to get the number of weeks. I tried iterating through columns using pandas, but it seems my output is wrong.

 st_dt                 end_dt
---------------------------------------
20100315           20100431
Sociopath
  • 13,068
  • 19
  • 47
  • 75
Quant
  • 61
  • 1
  • 7
  • show us what you tried and what's the output you getting? Also provide expected output. – Sociopath Feb 14 '18 at 05:45
  • This is what i am coding, for idx, date in data.iterrows(): data['expdate'] = datetime.strptime( str(date['exdate']), '%Y%m%d') - datetime.strptime( str(date['date']), '%Y%m%d') after this how can i update the my results in CSV file – Quant Feb 14 '18 at 06:21
  • @Quant - what is expected output from your input? – jezrael Feb 14 '18 at 06:23
  • expected output should be in number of days, which i can convert into weeks or months . Please note i have thousand of dates in csv. – Quant Feb 14 '18 at 06:26
  • @Quant - `20100431` does not exist, only `20100430` it is typo? – jezrael Feb 14 '18 at 06:29

1 Answers1

2

Use read_csv with parse_dates for datetimes and then after substract days:

df = pd.read_csv(file, parse_dates=[0,1])
print (df)
       st_dt     end_dt
0 2010-03-15 2010-04-30

df['diff'] = (df['end_dt'] - df['st_dt']).dt.days
print (df)
       st_dt     end_dt  diff
0 2010-03-15 2010-04-30    46

If some dates are wrong like 20100431 use to_datetime with parameter errors='coerce' for convert them to NaT:

df = pd.read_csv(file)
print (df)
      st_dt    end_dt
0  20100315  20100431
1  20100315  20100430

df['st_dt'] = pd.to_datetime(df['st_dt'], errors='coerce', format='%Y%m%d')
df['end_dt'] = pd.to_datetime(df['end_dt'], errors='coerce', format='%Y%m%d')

df['diff'] = (df['end_dt'] - df['st_dt']).dt.days
print (df)
       st_dt     end_dt  diff
0 2010-03-15        NaT   NaN
1 2010-03-15 2010-04-30  46.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for your effort, but this is the way i am doing , because the values from CSV are in raw format i think we cannot directly use dt.days values = [] for idx, date in data.iterrows(): result = datetime.strptime( str(date['exdate']), '%Y%m%d') - datetime.strptime( str(date['date']), '%Y%m%d') values.append(result) – Quant Feb 14 '18 at 07:11
  • Yes, the best not use it, because `iterrows` very slow. – jezrael Feb 14 '18 at 07:12
  • check [this](https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316) – jezrael Feb 14 '18 at 07:13