6

I am currently using python trying to split a datetime column into 2, one for Date and one for time and also have the column properly formatted.

ORIGINAL DATASET

INCIDENT_DATE
12/31/2006 11:20:00 PM
12/31/2006 11:30:00 PM
01/01/2007 00:25
01/01/2007 00:10
12/31/2006 11:30:00 AM
01/01/2007 00:05
01/01/2007 00:01
12/31/2006 4:45:00 PM
12/31/2006 11:50:00 PM
**01/01/2007**

*I have used 2 codes, one to format the column and the other that splits it. However, after formatting the column, missing time values were giving 00:00:00 value, here indicating a time for 12 midnight.See below

AFTER FORMATTING

2006-12-31 23:20:00
2006-12-31 23:30:00
2007-01-01 00:25:00
2007-01-01 00:10:00
2006-12-31 11:30:00
2007-01-01 00:05:00
2007-01-01 00:01:00
2006-12-31 16:45:00
2006-12-31 23:50:00
**2007-01-01 00:00:00**

Codes used:

## Format datetime column
crimeall['INCIDENT_DATE'] = pd.DatetimeIndex(crimeall['INCIDENT_DATE'])

##Split DateTime column
crimeall['TIME'],crimeall['DATE']= crimeall['INCIDENT_DATE'].apply(lambda x:x.time()), crimeall['INCIDENT_DATE'].apply(lambda x:x.date())

Is there away to do this without having the missing time value set at 00:00:00? Is it possible to have these missing values recorded as Nan while formatting the datetime?

Any thoughts on how I can achieve a formatted datetime showing the missing time values as NaN.

WHAT I WOULD LIKE IT TO LOOK LIKE

2006-12-31 23:20:00
2006-12-31 23:30:00
2007-01-01 00:25:00
2007-01-01 00:10:00
2006-12-31 11:30:00
2007-01-01 00:05:00
2007-01-01 00:01:00
2006-12-31 16:45:00
2006-12-31 23:50:00
**2007-01-01 NaN**

Hoping that there is a way to get this done.

Ed Smith
  • 12,716
  • 2
  • 43
  • 55
Nerine
  • 63
  • 6
  • Sorry can you edit your question as I don't understand what the `**` are for. Also can you post your real raw input data if you are using a recent version of pandas then you should be able, if `INCIDENT_DATE` is already a datetime dtype to just do `crimeall['TIME'],crimeall['DATE']= crimeall['INCIDENT_DATE'].dt.time, crimeall['INCIDENT_DATE'].dt.date` – EdChum May 12 '15 at 13:10
  • If I understand your question correctly, your problem here is that your date strings are of an inconsitent format so when you construct a datetimeindex or use `pd.to_datetime` then any missing time portion (or date portion for that matter) a default value will be supplied, You could replace the `00:00:00` with `NaT` afterwards – EdChum May 12 '15 at 13:30
  • 1
    But it's problematic to assume `00:00:00` is `NaT` because it could very well be valid. Figuring out which values are valid needs to be done **before** converting to datetime. – JohnE May 12 '15 at 15:59

2 Answers2

1

Add ambiguous =‘NaT’ to pd.DatetimeIndex. If that doesn't work, you could always patch the values using something like

crimeall['TIME'] = [np.NaN if t.isoformat()=='00:00:00' else t for t in crimeall['TIME']]
Ed Smith
  • 12,716
  • 2
  • 43
  • 55
  • Hello EdChurn and Ed Smith. Thank you for your comments and sorry for been a bit unclear, I am new to python & pandas. You are right that the date strings are inconsitant and that when constructing the datetimeindex, default values (missing values) are given as 00:00:00. I cannot replace these with 'NaT's after, as I also have real time values of 00:00:00. I will also be splitting them to 'Date' and 'Time' if that might make a difference. The '**' was only to draw attention to the particular instance's. Sorry for confusion. Thank you again for your help. – Nerine May 12 '15 at 14:45
  • Also the ** refers to the datatime column where the time is missing. – Nerine May 12 '15 at 14:46
  • Not very elegant but if you start with a string column, you can create a variable in a first step with value 1 when only the date is provided e.g: `crimeall['HOUR_MISSING'] = crimeall['INCIDENT_DATE'].apply(lambda x: 1 if re.match('^[0-9]{2}/[0-9]{2}/[0-9]{4}$', x) else 0)` – etna May 12 '15 at 15:01
0

I don't believe there is any way to have a datetime-like column that is part valid and part NaN. Note that a datetime is essentially a format on top of an integer, and an integer can't be half valid and half missing (a little more on this below).

Anyway, I would just make a new column for the time than includes NaNs. Starting with the following, where 'raw_dt' is your raw data and 'formatted_dt' is a proper datetime:

                   raw_dt        formatted_dt
0  12/31/2006 11:20:00 PM 2006-12-31 23:20:00
1  12/31/2006 11:30:00 PM 2006-12-31 23:30:00
...
7   12/31/2006 4:45:00 PM 2006-12-31 16:45:00
8  12/31/2006 11:50:00 PM 2006-12-31 23:50:00
9              01/01/2007 2007-01-01 00:00:00

I would create a mask, something like this:

df['valid_time'] = df.raw_dt.str.contains(':')

which ought to work fine here and you could use a regex if you need something more sophisticated. Then create a new time column.

df['time'] = df.ix[df['valid_time'],'formatted_dt'].dt.time

                   raw_dt        formatted_dt valid_time      time
0  12/31/2006 11:20:00 PM 2006-12-31 23:20:00       True  23:20:00
1  12/31/2006 11:30:00 PM 2006-12-31 23:30:00       True  23:30:00
...
7   12/31/2006 4:45:00 PM 2006-12-31 16:45:00       True  16:45:00
8  12/31/2006 11:50:00 PM 2006-12-31 23:50:00       True  23:50:00
9              01/01/2007 2007-01-01 00:00:00      False       NaN

From there you can format however you like, for example:

df.formatted_dt.dt.date.map(str) + df.time.map(str).str.rjust(9)

0    2006-12-31 23:20:00
1    2006-12-31 23:30:00
...
7    2006-12-31 16:45:00
8    2006-12-31 23:50:00
9    2007-01-01      nan

To briefly expand on what a datetime is, take a look here and note that you can do this as a peek at what datetime really is (nano-seconds since Jan 1, 1970):

df.formatted_dt.astype(np.int64)

0    1167607200000000000
1    1167607800000000000
...
7    1167583500000000000
8    1167609000000000000
9    1167609600000000000
Community
  • 1
  • 1
JohnE
  • 29,156
  • 8
  • 79
  • 109