5

I have a pandas data frame like this:

timestamp           status  
2019-01-01 09:00:00 FAILED
2019-01-01 09:00:00 FAILED
2019-01-01 09:00:00 UNKNOWN
2019-01-01 09:00:00 PASSED
2019-01-01 09:00:00 PASSED
2019-01-01 09:01:00 PASSED
2019-01-01 09:01:00 FAILED 

How can I group the data per minute and count the number of each status per minute to get this data frame:

timestamp           PASSED FAILED UNKNOWN   
2019-01-01 09:00:00 2      2      1
2019-01-01 09:01:00 1      1      0
charemma
  • 53
  • 1
  • 4
  • 1
    `pd.crosstab(df['timestamp'],df['status'])` – ansev Dec 04 '19 at 21:25
  • Are you ever going to have second values that you want grouped by minute? – user3483203 Dec 04 '19 at 21:26
  • Tried your suggestion and it seems to work, thanks a lot! In the original data, the timestamps will have multiple entries within a minute and thus will differ in the range of seconds. e.g. timestamp status 2019-01-01 09:00:01 FAILED 2019-01-01 09:00:02 FAILED 2019-01-01 09:00:10 UNKNOWN 2019-01-01 09:00:13 PASSED 2019-01-01 09:00:24 PASSED 2019-01-01 09:01:02 PASSED 2019-01-01 09:01:30 FAILED – charemma Dec 04 '19 at 21:35
  • If the answer is crosstab, the question is pivot – cs95 Dec 04 '19 at 22:03

3 Answers3

4

method 1:

pd.crosstab(df['timestamp'],df['status'])

status               FAILED  PASSED  UNKNOWN
timestamp                                   
2019-01-01-09:00:00       2       2        1
2019-01-01-09:01:00       1       1        0

If you want timestamp like a column :

pd.crosstab(df['timestamp'],df['status'],colnames=[None]).reset_index()

             timestamp  FAILED  PASSED  UNKNOWN
0  2019-01-01-09:00:00       2       2        1
1  2019-01-01-09:01:00       1       1        0

method 2:

df.groupby(['timestamp','status']).size().unstack(fill_value=0)

time comparision:

it seems that method 2 is the fastest.

%%timeit
new_df=pd.crosstab(df['timestamp'],df['status'])
21 ms ± 759 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
new_df=df.groupby(['timestamp','status']).size().unstack(fill_value=0)
4.65 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df2 = (
    df
    .groupby(df['timestamp'].map(lambda x: x.replace(second=0)))['status']
    .value_counts()
    .unstack()
    .fillna(0)
    .astype(int)
    .reset_index()
)

8.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
ansev
  • 30,322
  • 5
  • 17
  • 31
3

In the event that the timestamps have seconds, you can first remove them to group on whole minutes.

df2 = (
    df
    .groupby(df['timestamp'].map(lambda x: x.replace(second=0)))['status']
    .value_counts()
    .unstack(fill_value=0)
    .reset_index()
)
>>> df2
status           timestamp  FAILED  PASSED  UNKNOWN
0      2019-01-01 09:00:00       2       2        1
1      2019-01-01 09:01:00       1       1        0

You may also wish to fill in every minute in the range. Same code as above, but don't reset the index at the end. Then:

df2 = df2.reindex(pd.date_range(df2.index[0], df2.index[-1], freq='1min'), fill_value=0)

Timings

Timings will certainly vary based on the datasets (small vs large, heterogeneous data vs. homogenous, etc.). Given that the dataset is basically a log, one would expect a lot of data with high variation in the timestamp. To create a more suitable test data, lets make the sample dataframe 100k times larger and then make the timestamps unique (one each minute).

df_ = pd.concat([df] * 100000)
df_['timestamp'] = pd.date_range(df_.timestamp.iat[0], periods=len(df_), freq='1min')

And here are the new timings:

%timeit pd.crosstab(df_['timestamp'],df['status'])
# 4.27 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_.groupby(['timestamp','status']).size().unstack(fill_value=0)
# 567 ms ± 34.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
(
    df_
    .groupby(['timestamp', 'status'])
    .size()
    .unstack(fill_value=0)
    .reset_index()
)
# 614 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
(
    df_
    .groupby(df['timestamp'].map(lambda x: x.replace(second=0)))['status']
    .value_counts()
    .unstack(fill_value=0)
    .reset_index()
)
# 147 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • 1
    I think unstack has option `fill_value` which eases you from `fillna().astype()`. – Quang Hoang Dec 04 '19 at 21:55
  • 1
    And generally, `groupby().value_count().unstack()` has better performance than other approaches. +1 – Quang Hoang Dec 04 '19 at 21:56
  • in general that is the same as grouping for both columns, see my second method @Quang Hoang – ansev Dec 04 '19 at 22:10
  • @ansev in my experience, that’s not true if you have a lot of combination on two columns. – Quang Hoang Dec 04 '19 at 22:12
  • 1
    @ansev How many rows does your sample dataframe contain. Just the 7 in the toy example? Also, I don't believe your method accounts for grouping on minutes where the timestamps contain seconds. – Alexander Dec 04 '19 at 22:13
  • yes, it is true that for data frames of more than 400,000 rows, it may be more optimal value _counts, if this solution is improved, otherwise it will be even much slower – ansev Dec 04 '19 at 22:21
  • @ansev Try your timings with more variation in your timestamp column on the larger sample dataframe, e.g. `df_ = pd.concat([df] * 100000)` and then `df_['timestamp'] = pd.date_range(df_.timestamp.iat[0], periods=len(df_), freq='1min')` – Alexander Dec 04 '19 at 22:31
  • As I said I agree that for large sizes and optimizing this solution (without it this solution worsens with the size of the data frame) value_counts is better – ansev Dec 04 '19 at 22:33
  • 1
    What do you mean "worsens the size of the dataframe"? – Alexander Dec 04 '19 at 22:38
0

This will work:

df.groupby(['timestamp', 'status']).size().unstack(level=1)
Brandon
  • 918
  • 6
  • 14