Get rid of data if there are less than 3 data points in the hour

Question

I am fairly new to this so please bear with me. I have a df where the index is in datetime format. My other columns are concentration and a count column that just consists of 1s.

Timestamp	Concentration	Count
2018-01-01 08:07:00	32.675305	1
2018-01-01 08:20:00	22.816844	1
2018-01-01 08:28:00	17.183438	1
2018-01-01 08:37:00	18.591789	1

I want to clean up the df by only including data where there are at least 3 concentration values recorded in the hour.

I tried resampling by hour and then getting a sum of the count column which shows me if the number of data points per hour meets the threshold of 3. And then I can get rid of rows where count is less than 3.

df2 = df.resample('H').sum()
df3 = df2[~(df2['Count'] < 3)]

From here, though, the concentrations have also been summed, which eventually I don't want. I'm wondering if there is a way to go back to before I resampled but without the purged data?

Is there another way to do this that would work better?

Can you please provide some example data in your question? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Tom McLean, Jun 25 '21 at 14:00
If you create a new column from your index, truncate it to the hour level, then groupby it, you'll be able to do a count. Selecting the values with a count less than 3, select the affected indicies and drop them. — ifly6, Jun 25 '21 at 14:07

Simon Bowly · Answer 1 · 2021-06-26T00:43:45.063

You should be able to join onto your resampled dataframe (using rounded down timestamps as a key) to give a column specifying whether there were more than 3 measurements in the hour of that record. For example:

df = df.set_index('Timestamp')
df2 = df.resample('H').sum()
df['floor'] = df.index.floor('H')
df.join((df2['Count'] >= 3).rename('keep'), on='floor')

result will have a true/false column named 'keep' you can then use to keep/discard the records you need.

score 1 · Answer 2 · answered Jun 25 '21 at 14:49

You can combine several operations at once, had to add extra rows to illustrate

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(
"""Timestamp    Concentration   Count
2018-01-01 08:07:00   32.675305   1
2018-01-01 08:20:00   22.816844   1
2018-01-01 08:28:00   17.183438   1
2018-01-01 09:37:00   18.591789   1
2018-01-01 09:37:00   18.591789   1"""), sep=" \s+", parse_dates=["Timestamp"])



>>> df[df.groupby(df.Timestamp.dt.floor('h'))['Count'].transform('count') > 2]
            Timestamp  Concentration  Count
0 2018-01-01 08:07:00      32.675305      1
1 2018-01-01 08:20:00      22.816844      1
2 2018-01-01 08:28:00      17.183438      1

As already mentioned in earlier answer, groupby should be performed on Timestamp column floor-ed to an hour. Then, use transform to assign hourly count value to groups. Finally, only use rows with such count > 2.

Nice. I like the use of `transform` here - IIUC it expands the result of agg/sum/count/whatever for groupby onto the original dataframe? — Simon Bowly, Jun 26 '21 at 00:50
Yes, one value per group in groupby and preserves the shape of original dataframe. — crayxt, Jun 26 '21 at 04:00

Get rid of data if there are less than 3 data points in the hour

2 Answers2