2

I have a data frame that I split from one column into two columns, like this.

df_all_files = pd.DataFrame(df_all_files.string.str.split('.',1).tolist(), columns = ['string','the_date'])

That give me a bunch of file extensions and a bunch of dates, all in the same column. I want to coerce anything that looks like it could be a date into an actual date, and delete anything that isn't a date. Is that doable?

Here's before sample of what I have.

                                                                     string     the_date

-rw-r--r--   64 30067    10224         616 Nov 01 17:46 ASEJPN_ModelHolidays    20181101
-rw-r--r--   64 30067    10224         616 Dec 03 19:23 ASEJPN_ModelHolidays    20181201
-rw-r--r--   74 30067    10224        4938 Oct 04 03:28 AS1181003               RATE

This is what I'd like it to look like after.

                                                                     string     the_date

-rw-r--r--   64 30067    10224         616 Nov 01 17:46 ASEJPN_ModelHolidays    20181101
-rw-r--r--   64 30067    10224         616 Dec 03 19:23 ASEJPN_ModelHolidays    20181201
-rw-r--r--   74 30067    10224        4938 Oct 04 03:28 AS1181003               181003

Now, I'm running this one-liner.

df_all_files['the_date'] = df_all_files['the_date'].dt.date

I'm getting this error.

AttributeError: Can only use .dt accessor with datetimelike values

I tried this as well.

df_all_files['the_date'] = df_all_files['string'].astype('datetime64[ns]')

As soon as it encounters a non-date, it give me this error.

ValueError: ('Unknown string format:', 'ach1')
ASH
  • 20,759
  • 19
  • 87
  • 200

1 Answers1

3

Use pd.to_datetime to convert your series to datetime first, using errors='coerce' to ensure non-convertible values are replaced by NaN:

df_all_files['the_date'] = pd.to_datetime(df_all_files['the_date'], errors='coerce').dt.date

However, I strongly recommend you avoid conversion to dt.date as this converts your series to an object dtype series of datetime.date objects, as opposed to an efficient Pandas datetime series.

Here are 4 examples of problems caused by using Python datetime.date or datetime.datetime objects with Pandas:

  1. TypeError: Cannot compare type 'Timestamp' with type 'date'
  2. Parse a Pandas column to Datetime
  3. Why my code didn't select data from Pandas dataframe?
  4. Filtering pandas dataframe by day
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Oh, man, I love this!! It does exactly what I want!! Thanks so much!! I have one more problem. In my original post, some dates come after the last '_' character, so I came up with this one-liner: df_all_files = pd.DataFrame(df_all_files.string.str.rsplit('_',1).tolist(), columns = ['string','the_date']) That parses the date after the '_' character, into the next column, but it over writes any dates in that column. How can I move three dates into the column named 'the_dates' (190103, 190104, and 190107) but skip 'CountryHolidays' because real dates are already in 'the_dates' column? – ASH Jan 09 '19 at 02:09
  • @ryguy72, Sorry, I'm having trouble understanding the problem. I suggest you post a new question with what you've tried. – jpp Jan 09 '19 at 09:15