1

I am using pandas to process some data from csv files.

I need to sort the data in my DataFrame df by the column MEETING START TIME, just sorting the time. The date is handled by another field.

But the result I get is:

MEETING START TIME
10:30 AM
12:30 PM
2:00 PM
4:00 PM
9:15 AM
9:15 AM

Any meeting time in the morning with single-digit hours goes as the end. Do I need to do something to the date format, or the sort command?

maxymoo
  • 35,286
  • 11
  • 92
  • 119
mattrweaver
  • 729
  • 4
  • 14
  • 36
  • 1
    A's the answer below points out, you cannot sort datetime when they are strings, they must be formatted as datetime objects.. – dartdog Aug 05 '15 at 01:08
  • possible duplicate of [Sort Pandas Dataframe by Date](http://stackoverflow.com/questions/28161356/sort-pandas-dataframe-by-date) – LondonRob Aug 06 '15 at 12:42

2 Answers2

3

You can use pd.datetools.parse to try and convert your date column (currently a string) into a datetime object; then you should be able to sort:

df = pd.DataFrame({'MEETING START TIME': ['10:30 AM','12:30 PM', '2:00 PM', '4:00 PM', '9:15 AM', '9:15 AM']})
df['MEETING START TIME'] = df['MEETING START TIME'].map(lambda x: pd.datetools.parse(x))
df.sort('MEETING START TIME')

Out[33]: 
   MEETING START TIME
5 2015-08-05 09:15:00
4 2015-08-05 09:15:00
0 2015-08-05 10:30:00
1 2015-08-05 12:30:00
2 2015-08-05 14:00:00
3 2015-08-05 16:00:00
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • would `pd.to_datetime` be quicker? – EdChum Aug 05 '15 at 07:56
  • ok, that works, Thanks. I need to strip out the date, which I'm doing with: df['MEETING START TIME'] = pd.DatetimeIndex(df['MEETING START TIME']).time Then I will convert it back to AM/PM as this is for signage and will be easier for people to read. – mattrweaver Aug 05 '15 at 13:54
  • 2
    @EdChum `%timeit pd.datetools.parse('10 AM') 10000 loops, best of 3: 39.7 µs per loop %timeit pd.to_datetime('10 AM') 1000 loops, best of 3: 159 µs per loop` – maxymoo Aug 05 '15 at 22:49
2

It would be interested to see if the approach of using map with datetools.parse is as scalable as the standard approach given here, here and here.

Let's make a really big Series of string-represented dates to find out:

In [11]: import datetime as dt
In [12]: format = '%d/%m/%Y %H:%M:%S'
In [13]: def random_date():
   ....:    rand_num = np.random.uniform(2e9)
   ....:    return dt.datetime.fromtimestamp(rand_num).strftime(format)
In [14]: dates = pd.Series([random_date() for i in range(100000)])
In [15]: dates.head() # Some random dates (as strings)
Out[15]: 
0    30/11/1988 15:11:08
1    08/05/2025 10:29:02
2    05/09/2017 02:24:46
3    18/03/2016 14:55:20
4    22/04/1984 04:58:06
dtype: object

Now let's time the two approaches:

In [33]: %timeit dates.map(lambda x: pd.datetools.parse(x))
1 loops, best of 3: 6.98 s per loop

In [2]: %timeit pd.to_datetime(dates,format=format)
1 loops, best of 3: 525 ms per loop

So there we have it. The unorthodox approach of @maxymoo is much slower than the accepted approach when the Series is long!

Community
  • 1
  • 1
LondonRob
  • 73,083
  • 37
  • 144
  • 201
  • 1
    you need to pass the format, e.g. ``In [36]: %timeit pd.to_datetime(dates,format=format) 1 loops, best of 3: 529 ms per loop``, otherwise its simply a dispatch to ``dateutil`` – Jeff Aug 12 '15 at 00:01
  • @Jeff That's embarrassing. Question rewritten with the correct timings (which I've double-checked myself). The `pd.to_datetime` method is blazingly quick. – LondonRob Aug 12 '15 at 16:52