9

Assume I have a timestamp column of datetime in a pandas.DataFrame. For the sake of example, the timestamp is in seconds resolution. I would like to bucket / bin the events in 10 minutes [1] buckets / bins. I understand that I can represent the datetime as an integer timestamp and then use histogram. Is there a simpler approach? Something built in into pandas?

[1] 10 minutes is only an example. Ultimately, I would like to use different resolutions.

Dror
  • 12,174
  • 21
  • 90
  • 160
  • 1
    This will probably get you close: `df.groupby(pd.TimeGrouper(freq='10Min')).mean().plot(kind="bar")` You can replace "bar" with "hist" but I'm not sure if that makes a lot of sense. I'm guessing the y-axis is supposed to be frequency but what is the x-axis supposed to be? Do you have an example of the original data and an example of what the plot should look like (even if its just a verbal description) – johnchase Jan 15 '16 at 21:32

1 Answers1

21

To use custom frequency like "10Min" you have to use a TimeGrouper -- as suggested by @johnchase -- that operates on the index.

# Generating a sample of 10000 timestamps and selecting 500 to randomize them
df = pd.DataFrame(np.random.choice(pd.date_range(start=pd.to_datetime('2015-01-14'),periods = 10000, freq='S'), 500),  columns=['date'])
# Setting the date as the index since the TimeGrouper works on Index, the date column is not dropped to be able to count
df.set_index('date', drop=False, inplace=True)
# Getting the histogram
df.groupby(pd.TimeGrouper(freq='10Min')).count().plot(kind='bar')

enter image description here

Using to_period

It is also possible to use the to_period method but it does not work -- as far as I know -- with custom period like "10Min". This example take an additional column to simulate the category of an item.

# The number of sample
nb_sample = 500
# Generating a sample and selecting a subset to randomize them
df = pd.DataFrame({'date': np.random.choice(pd.date_range(start=pd.to_datetime('2015-01-14'),periods = nb_sample*30, freq='S'), nb_sample),
                  'type': np.random.choice(['foo','bar','xxx'],nb_sample)})

# Grouping per hour and type
df = df.groupby([df['date'].dt.to_period('H'), 'type']).count().unstack()
# Droping unnecessary column level
df.columns = df.columns.droplevel()
df.plot(kind='bar')

enter image description here

Romain
  • 19,910
  • 6
  • 56
  • 65
  • This is getting me closer. Thanks. Two issues I still have: 1) the x-axis ticks are not related to the datetime nature of the data and 2) shouldn't the "sum of bars" be 500? – Dror Jan 18 '16 at 08:15
  • Shouldn't it be `.plot(kind='bar')` instead of `.hist()` as suggested by @johnchase ? – Dror Jan 18 '16 at 08:42
  • Sorry I made a big mistake in my first answer (going too fast is not the solution). I've just edited it and think that it solves your problem now. And the ``sum`` is 500 now :-) – Romain Jan 18 '16 at 13:24
  • I actually like the solution with `dt.to_period` better. Forcing the index to be the timestamp is a big constrain. – Dror Jan 18 '16 at 13:56
  • Here is a notebook with some examples http://nbviewer.jupyter.org/gist/drorata/e58b673fd87edfc92960 – Dror Jan 18 '16 at 13:58
  • I just saw your example and modified my answer to propose a clever way to plot histogram with several series. – Romain Jan 18 '16 at 15:26
  • This answer is really useful. Thank you. – stackoverflowuser2010 Feb 03 '18 at 20:05