0

I am fairly new to this so please bear with me. I have a df where the index is in datetime format. My other columns are concentration and a count column that just consists of 1s.

Timestamp Concentration Count
2018-01-01 08:07:00 32.675305 1
2018-01-01 08:20:00 22.816844 1
2018-01-01 08:28:00 17.183438 1
2018-01-01 08:37:00 18.591789 1

I want to clean up the df by only including data where there are at least 3 concentration values recorded in the hour.

I tried resampling by hour and then getting a sum of the count column which shows me if the number of data points per hour meets the threshold of 3. And then I can get rid of rows where count is less than 3.

df2 = df.resample('H').sum()
df3 = df2[~(df2['Count'] < 3)]

From here, though, the concentrations have also been summed, which eventually I don't want. I'm wondering if there is a way to go back to before I resampled but without the purged data?

Is there another way to do this that would work better?

sid
  • 1
  • 1
  • 1
    Can you please provide some example data in your question? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Tom McLean Jun 25 '21 at 14:00
  • If you create a new column from your index, truncate it to the hour level, then groupby it, you'll be able to do a count. Selecting the values with a count less than 3, select the affected indicies and drop them. – ifly6 Jun 25 '21 at 14:07

2 Answers2

1

You should be able to join onto your resampled dataframe (using rounded down timestamps as a key) to give a column specifying whether there were more than 3 measurements in the hour of that record. For example:

df = df.set_index('Timestamp')
df2 = df.resample('H').sum()
df['floor'] = df.index.floor('H')
df.join((df2['Count'] >= 3).rename('keep'), on='floor')

result will have a true/false column named 'keep' you can then use to keep/discard the records you need.

Simon Bowly
  • 1,003
  • 5
  • 10
1

You can combine several operations at once, had to add extra rows to illustrate

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(
"""Timestamp    Concentration   Count
2018-01-01 08:07:00   32.675305   1
2018-01-01 08:20:00   22.816844   1
2018-01-01 08:28:00   17.183438   1
2018-01-01 09:37:00   18.591789   1
2018-01-01 09:37:00   18.591789   1"""), sep=" \s+", parse_dates=["Timestamp"])



>>> df[df.groupby(df.Timestamp.dt.floor('h'))['Count'].transform('count') > 2]
            Timestamp  Concentration  Count
0 2018-01-01 08:07:00      32.675305      1
1 2018-01-01 08:20:00      22.816844      1
2 2018-01-01 08:28:00      17.183438      1

As already mentioned in earlier answer, groupby should be performed on Timestamp column floor-ed to an hour. Then, use transform to assign hourly count value to groups. Finally, only use rows with such count > 2.

crayxt
  • 2,367
  • 2
  • 12
  • 17
  • Nice. I like the use of `transform` here - IIUC it expands the result of agg/sum/count/whatever for groupby onto the original dataframe? – Simon Bowly Jun 26 '21 at 00:50
  • Yes, one value per group in groupby and preserves the shape of original dataframe. – crayxt Jun 26 '21 at 04:00