0

I have a date column that's of string/object type:

     Day - 2022  Day - 2021  ...
0    01/01/2022  01/01/2021  ...
1    02/01/2022  02/01/2021  ...
2    03/01/2022  03/01/2021  ...
3    04/01/2022  04/01/2021  ...
4    05/01/2022  05/01/2021  ...
..          ...         ...  ...
725  27/12/2023         NaN  ...
726  28/12/2023         NaN  ...
727  29/12/2023         NaN  ...
728  30/12/2023         NaN  ...
729  31/12/2023         NaN  ...

I can cast to date no problem like so:

pd.to_datetime(df["Day - 2022"])
0     2022-01-01
1     2022-02-01
2     2022-03-01
3     2022-04-01
4     2022-05-01
         ...    
725   2023-12-27
726   2023-12-28
727   2023-12-29
728   2023-12-30
729   2023-12-31
Name: Day - 2022, Length: 730, dtype: datetime64[ns]

But when I use np.where to detect whether a string column is in fact a date in string format it returns big integer instead:

col = "Day - 2022"
pattern = "^(0?[1-9]|[12][0-9]|3[01])[\/\-](0?[1-9]|1[012])[\/\-]\d{2,4}"
df[col] = np.where(
     df[col].str.match(pattern),
     pd.to_datetime(df[col]),
     df[col],
)
              Day - 2022  Day - 2021  ... 
0    1640995200000000000  01/01/2021  ... 
1    1643673600000000000  02/01/2021  ... 
2    1646092800000000000  03/01/2021  ... 
3    1648771200000000000  04/01/2021  ... 
4    1651363200000000000  05/01/2021  ... 
..                   ...         ...  ... 
725  1703635200000000000         NaN  ... 
726  1703721600000000000         NaN  ... 
727  1703808000000000000         NaN  ... 
728  1703894400000000000         NaN  ... 
729  1703980800000000000         NaN  ... 

Confused as to why this is happening? Any ideas on how to prevent this from happening?

AK91
  • 671
  • 2
  • 13
  • 35
  • Seems like a timestamp in nanoseconds, try to play around with pd.to_datetime() parameters (unit, origin,...) – Let's try Dec 07 '22 at 18:07
  • No idea but it seems to be converting the string into UNIX time format so just change it to `pd.to_datetime(df[col],unit='ns')` or using `datetime.datetime.fromtimestamp(1640995200000000000 * 10e-10)` – PhilippB Dec 07 '22 at 18:10
  • The reason is `df[col]` is of object type and by default, the use in np.where of either a datetime type (from `pd.to_datetime(df[col])`) or an object type (from `df[col]`) will result in an array of object type. I understand that in your example, the `df[col]` is never really possible in the `np.where`, but the casting to object of the result is probably done before. Try yourself that the result is the same if you cast to object `pd.to_datetime(df["Day - 2022"]).to_numpy().astype('object')` – Ben.T Dec 07 '22 at 19:06

1 Answers1

0

Here's what worked for me:

col = "Day - 2022"
pattern = "^(0?[1-9]|[12][0-9]|3[01])[\/\-](0?[1-9]|1[012])[\/\-]\d{2,4}"
df[col] = np.where(
     df[col].str.match(pattern),
     np.asarray(
        df[col].astype(np.datetime64),
        dtype="datetime64[s]",
     ),
     df[col],
)

So the np.where was converting the date strings into a numpy array of unix timestamps, thereafter used answer from here to convert the array into datetime types.

AK91
  • 671
  • 2
  • 13
  • 35