1

I have data in a csv file that contains the following fields:

  • user_id
  • date_created
  • date_edited
  • date_finalised; and
  • date_withdrawn

User_id and date_created fields will never contain null values but the other columns invariably will.

An example:

user_id, date_created, date_edited, date_finalised, date_withdrawn
1, 2013-01-31 00:17:01, null, 2013-02-02 14:11:17, null
2, 2013-01-31 01:00:15, 2013-01-31 01:00:30, null, null

I would like end up with a DataFrame containing a count of records for each datetime column that occur within certain date period bins i.e. daily, hourly and minutely

Using the example above and daily frequency I would see:

date, date_created, date_edited, date_finalised, date_withdrawn
2013-01-31, 2, 1, 0, 0
2013-02-01, 0, 0, 0, 0
2013-02-02, 0, 0, 1, 0

After importing the file with

data = pd.read_csv('filename.csv')

What steps are required to achieve this?

Emil
  • 249
  • 1
  • 2
  • 14

1 Answers1

3

First, ensure that each date column is a datetime64 column (with many rows you should be able to use parse_dates in the read_csv but you may have to use to_datetime and coerce=True):

In [11]: df
Out[11]: 
   user_id        date_created         date_edited      date_finalised date_withdrawn
0        1 2013-01-31 00:17:01                 NaT 2013-02-02 14:11:17            NaT
1        2 2013-01-31 01:00:15 2013-01-31 01:00:30                 NaT            NaT

Now you can take the value count for each date e.g. for the date_created column:

In [12]: pd.value_counts(pd.DatetimeIndex(df.date_created).normalize())
Out[12]: 
2013-01-31    2
dtype: int64

And concat these Series together:

In [13]: pd.concat((pd.value_counts(pd.DatetimeIndex(df[col]).normalize()) for col in df.columns if 'date' in col), axis=1).fillna(0)
Out[13]: 
            0  1  2  3
NaT         0  1  1  2
2013-01-31  2  1  0  0
2013-02-02  0  0  1  0

Note: you can .drop(pd.NaT) if you don't want this row.

Unfortunately read_csv's parse_dates=['date_created', 'date_edited', 'date_finalised', 'date_withdrawn'] doesn't coerce with this small examples (too many nulls), one hack is to use something like:

for col in df.columns:
    if 'date' in col:
        df[col] = pd.to_datetime(df[col], coerce=True)
Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    Thanks. That worked a treat :-). I also appreciate the clarity of your answer it helped me _understand_ each stage rather than just replicate it. – Emil Jan 25 '14 at 22:15