1

I have a dataframe df as follows where a Date field is present. I am reading this from excel using pd.read_excel() method.

    Company Code    Trxn_date                Company Name     Type
      20040       2019-05-11 00:00:00             ABC        Series A #<--the date is in `datetime` object only.
      20034       2019-04-26 00:00:00             XYZ        Series A
      20033       "5/15/2018\n23/4/2019"          PQR       "Series A  
                                                             Series B" # <-- In same row. 
      20040       2019-06-05 00:00:00             ABC        Series B
      20056        8/16/2019                      MNO        Series B

As you can see that for 20033 there are two entries in Trxn_date apart by \n. Same is true for Type field. Because of this, if I apply pd.to_datetime(df['Trxn_date']) I get an obvious error TypeError: invalid string coercion to datetime. I do not want to coerce option.

Please note that except 20033 and 20056 all dates are automatically converted to datetime object by pandas.

I would like to get the df as follows.

    Company Code    Trxn_date         Company Name    Type
      20040       2019-05-11            ABC         Series A   
      20034       2019-04-26            XYZ         Series A
      20033       2019-04-23            PQR         Series B  #<--Only the last date string is picked up and converted to datetime. 
      20040       2019-06-05            ABC         Series B
      20056       2019-08-16            MNO         Series B #<--The date format is changed to `yyyy-mm-dd`. 

I am not able to get any clue to achieve the same above. For 20056 I can use pd.to_datetime(df['Trxn_date'],errors='coerce').apply(lambda x : x.strftime('%Y-%m-%d') if pd.notnull(x) else ' '). This operation creates a blank in Trxn_date field for 20033.

Can anybody give any insight on this? May be I have to write a function and then use lambda for the same?

pythondumb
  • 1,187
  • 1
  • 15
  • 30

1 Answers1

2

You can split by \n and get last lists by str[-1], but separator \n depends of real data, so should be different:

df['Trxn_date'] = df['Trxn_date'].str.split('\n').str[-1]
df['Type'] = df['Type'].str.split('\n').str[-1]

df['Trxn_date'] = pd.to_datetime(df['Trxn_date'],errors='coerce').dt.strftime('%Y-%m-%d')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252