2

I want to break down a list of datetimes into 15 (or 10 or 30 maybe) minute buckets, and count how many objects are in each bucket.

The ideal output is a list of integers, each item being a count for a 15 minute bucket, the list in the original datetime order from earliest to latest

The actual dates and times themselves are not important in this application.

The datetimes are Tweet creation datetimes, and are in Twitter's native string format ("%a %b %d %H:%M:%S +0000 %Y"), as seen in the data snippet below.

(It's no problem to convert them to Unix time or whatever's most convenient, if it helps)

data snippet:

['Wed Jul 07 07:39:41 +0000 2021',
 'Wed Jul 07 09:25:06 +0000 2021',
 'Wed Jul 07 10:12:24 +0000 2021',
 'Wed Jul 07 12:03:36 +0000 2021',
 'Wed Jul 07 12:51:56 +0000 2021',
 'Thu Jul 08 18:01:02 +0000 2021',
 'Thu Jul 08 18:02:01 +0000 2021',
 'Thu Jul 08 18:02:40 +0000 2021',
 'Thu Jul 08 18:03:45 +0000 2021',
 'Thu Jul 08 18:04:10 +0000 2021',
 'Thu Jul 08 18:16:05 +0000 2021',
 'Thu Jul 08 18:17:40 +0000 2021',
 'Thu Jul 08 18:22:04 +0000 2021',
 'Thu Jul 08 18:23:02 +0000 2021',
 'Thu Jul 08 18:24:34 +0000 2021',
 'Thu Jul 08 21:07:36 +0000 2021',
 'Fri Jul 09 07:31:41 +0000 2021',
 'Fri Jul 09 07:45:14 +0000 2021',
 'Fri Jul 09 08:37:09 +0000 2021',
 'Fri Jul 09 09:32:22 +0000 2021',
 'Fri Jul 09 10:49:53 +0000 2021',
 'Fri Jul 09 11:33:48 +0000 2021',
 'Fri Jul 09 11:35:02 +0000 2021',
 'Fri Jul 09 11:35:43 +0000 2021',
 'Fri Jul 09 12:41:08 +0000 2021',
 'Fri Jul 09 12:41:37 +0000 2021',
 'Fri Jul 09 12:42:38 +0000 2021',
 'Fri Jul 09 13:26:51 +0000 2021',
 'Fri Jul 09 13:41:18 +0000 2021',
 'Fri Jul 09 13:45:51 +0000 2021',
 'Fri Jul 09 14:03:37 +0000 2021',
 'Fri Jul 09 17:59:09 +0000 2021',
 'Fri Jul 09 19:36:01 +0000 2021',
 'Fri Jul 09 19:40:46 +0000 2021',
 'Sat Jul 10 08:34:06 +0000 2021',
 ...
 ]

I suppose I could convert all the datetimes to unix time & write a loop to chunk it into 900 second buckets, but it seems clunky when pandas seems to have builtins for this sort of thing.

(I've seen e.g. Pandas resample timeseries data to 15 mins and 45 mins - using multi-index or column and the pandas docs themselves e.g. https://pandas.pydata.org/docs/reference/api/pandas.Series.resample.html?highlight=resample#pandas.Series.resample)

So I've had a go and so far I've got what's below, but now I'm stuck and need some help.

(I'm not a professional programmer & this isn't coursework or homework, though I have written quite a lot of simple Python in the past couple of years; For completeness sake, the purpose here is to create data that can be used to drive synths (soft or hard) to create sonic representations of Twitter user timelines, and I'm just tinkering with the most basic thing I can think of to start with)


# where "x" is a list of datetimes as above

df = pd.DataFrame(x, columns=["created_at"])

df["cti"] = pd.to_datetime(df["created_at"])

dfrs = df.set_index("cti")

qbert = dfrs["created_at"].resample("15T").sum()

print(qbert)

I thought from my reading of the pandas docs etc that this would give me an output with summary counts for each bucket (but I'm likely to have misunderstood or misinterpreted: I'm not a "natural" coder)

But the output I get is this:

    cti
2021-07-07 07:30:00+00:00                       Wed Jul 07 07:39:41 +0000 2021
2021-07-07 07:45:00+00:00                                                    0
2021-07-07 08:00:00+00:00                                                    0
2021-07-07 08:15:00+00:00                                                    0
2021-07-07 08:30:00+00:00                                                    0
                                                   ...                        
2021-08-05 13:45:00+00:00                       Thu Aug 05 13:58:07 +0000 2021
2021-08-05 14:00:00+00:00    Thu Aug 05 14:02:32 +0000 2021Thu Aug 05 14:05...
2021-08-05 14:15:00+00:00    Thu Aug 05 14:20:49 +0000 2021Thu Aug 05 14:23...
2021-08-05 14:30:00+00:00    Thu Aug 05 14:30:59 +0000 2021Thu Aug 05 14:31...
2021-08-05 14:45:00+00:00    Thu Aug 05 14:45:56 +0000 2021Thu Aug 05 14:52...
Freq: 15T, Name: created_at, Length: 2814, dtype: object

So this isn't what I was expecting but I'm unsure where I've gone wrong or whether I've even selected an appropriate approach for what I want to do.

redacted code
  • 192
  • 10

1 Answers1

3

You almost had it, butsum will concatenate the strings. You need to count instead:

qbert = dfs["created_at"].resample("15T").count()
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Ah thanks; The pandas docs seems to show using sum() for the output I now get with count(), but maybe I'm misunderstanding – redacted code Aug 08 '21 at 09:08
  • 1
    `resample` acts similarly to grouping, It returns a Resampler object that expects a method to output something useful. You can pass a lot of various functions, including `sum` and `count`. – mozway Aug 08 '21 at 09:13