2

I have a data series with a random date column as my index, a numbered value as well as three columns that each indicate whether a safety mechanism is activated to block the numbered value. Example is:

DateTime         Safe1    Safe2    Safe3    Measurement

1/8/2013 6:06     N       Y        N    

1/8/2013 6:23     N       Y        N    

1/8/2013 6:40     N       N        N        28

1/8/2013 6:57     N        N       N        31

I need to resample the data using Pandas in order to create clean half-hour interval data, taking the mean of values where any exist. Of course, this removes the three safety string columns.

However, I would like to include a column that indicates Y if any combination of the safety mechanisms are activated for the entire half-hour interval.

How do I get this string column showing Y in the resampled data indicating a Y was present in the raw data amongst the three safety mechanism columns without any values in the Measurement?

Desired Output based upon above:

DateTime      Safe1  Measurement

1/8/2013 6:00 Y      

1/8/2013 6:30 N      29.5
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33

3 Answers3

3

I don't think it's possible to do what you want with the resample function, as there's not much customisation you can do. We have to do a TimeGrouper with a groupby operation.

First creating the data :

import pandas as pd

index = ['1/8/2013 6:06', '1/8/2013 6:23', '1/8/2013 6:40', '1/8/2013 6:57']

data = {'Safe1' : ['N', 'N', 'N', 'N'], 
        'Safe2': ['Y', 'Y', 'N', 'N'], 
        'Safe3': ['N', 'N', 'N', 'N'], 
        'Measurement': [0,0,28,31]}

df = pd.DataFrame(index=index, data=data)
df.index = pd.to_datetime(df.index)
df

output :

                     Measurement Safe1 Safe2 Safe3
2013-01-08 06:06:00            0     N     Y     N
2013-01-08 06:23:00            0     N     Y     N
2013-01-08 06:40:00           28     N     N     N
2013-01-08 06:57:00           31     N     N     N

Then let's add a helper column, called Safe, that will be a concatenation of all the Safex columns. If there's at least one Y in the Safe column, we'll know that the safety mechanism was activated.

df['Safe'] = df['Safe1'] + df['Safe2'] + df['Safe3']
print df

output :

                     Measurement Safe1 Safe2 Safe3 Safe
2013-01-08 06:06:00            0     N     Y     N  NYN
2013-01-08 06:23:00            0     N     Y     N  NYN
2013-01-08 06:40:00           28     N     N     N  NNN
2013-01-08 06:57:00           31     N     N     N  NNN

finally, we're going to define a custom function, that will return Y if there's at least one Y in the list of strings that is passed as an argument.

That custom function is passed on the Safe column, after we have grouped it by 30 Min intervals :

def func(x):
    x = ''.join(x.values)
    return 'Y' if 'Y' in x else 'N'

df.groupby(pd.TimeGrouper(freq='30Min')).agg({'Measurement': 'mean', 'Safe': func })

output :

                    Safe  Measurement
2013-01-08 06:00:00    Y          0.0
2013-01-08 06:30:00    N         29.5
knightofni
  • 1,906
  • 3
  • 17
  • 22
  • pd.TimeGrouper was renamed pd.Grouper... and in my case the column agg-column names had to exist beforehand (compare https://stackoverflow.com/questions/60229375/solution-for-specificationerror-nested-renamer-is-not-supported-while-agg-alo). Other than that it still works like a charm :) – sltzgs Sep 16 '20 at 08:21
2

Here's an answer using pandas built-in resample function.

First combine the 3 Safe values into a single column:

df['Safe'] = df.Safe1 + df.Safe2 + df.Safe3

Turn the 3-letter strings into a 0-1 variable:

df.Safe = df.Safe.apply(lambda x: 1 if 'Y' in x else 0)

Write a custom resampling function for the 'Safes' column:

def f(x):
  if sum(x) > 0: return 'Y'
  else: return 'N'

Finally, resample:

df.resample('30T').Safe.agg({'Safe': f}).join(df.resample('30T').Measurement.mean())

Output:

                    Safe  Measurement 
2013-01-08 06:00:00  Y        0.0
2013-01-08 06:30:00  N        29.5
Alicia
  • 21
  • 1
0

I manually resample the date (easy if it is rounding)....

Here is an example

from random import shuffle
from datetime import datetime, timedelta
from itertools import zip_longest
from random import randint, randrange, seed
from tabulate import tabulate
import pandas as pd

def df_to_md(df):
    print(tabulate(df, tablefmt="pipe",headers="keys"))

seed(42)

people=['tom','dick','harry']
avg_score=[90,50,10]
date_times=[n for n in pd.date_range(datetime.now()-timedelta(days=2),datetime.now(),freq='5 min').values]
scale=1+int(len(date_times)/len(people))
score =[randint(i,100)*i/10000 for i in avg_score*scale]

df=pd.DataFrame.from_records(list(zip(date_times,people*scale,score)),columns=['When','Who','Status'])
# Create 3 records tom should score 90%, dick 50% and poor harry only 10% 
# Tom should score well
df_to_md(df[df.Who=='tom'].head())

The table is in Markdown format - just to easy my cut and paste....

|    | When                       | Who   |   Status |
|---:|:---------------------------|:------|---------:|
|  0 | 2019-06-18 14:07:17.457124 | tom   |    0.9   |
|  3 | 2019-06-18 14:22:17.457124 | tom   |    0.846 |
|  6 | 2019-06-18 14:37:17.457124 | tom   |    0.828 |
|  9 | 2019-06-18 14:52:17.457124 | tom   |    0.9   |
| 12 | 2019-06-18 15:07:17.457124 | tom   |    0.819 |

Harry scores badly

df_to_md(df[df.Who=='harry'].head())
|    | When                       | Who   |   Status |
|---:|:---------------------------|:------|---------:|
|  2 | 2019-06-18 14:17:17.457124 | harry |    0.013 |
|  5 | 2019-06-18 14:32:17.457124 | harry |    0.038 |
|  8 | 2019-06-18 14:47:17.457124 | harry |    0.023 |
| 11 | 2019-06-18 15:02:17.457124 | harry |    0.079 |
| 14 | 2019-06-18 15:17:17.457124 | harry |    0.064 |

Lets get the average per hour per person

def round_to_hour(t):
    # Rounds to nearest hour by adding a timedelta hour if minute >= 30
    return (t.replace(second=0, microsecond=0, minute=0, hour=t.hour)
               +timedelta(hours=t.minute//30))

And generate a new column using this method.

df['WhenRounded']=df.When.apply(lambda x: round_to_hour(x))
df_to_md(df[df.Who=='tom'].head())

This should be tom's data - showing original and rounded.

|    | When                       | Who   |   Status | WhenRounded         |
|---:|:---------------------------|:------|---------:|:--------------------|
|  0 | 2019-06-18 14:07:17.457124 | tom   |    0.9   | 2019-06-18 14:00:00 |
|  3 | 2019-06-18 14:22:17.457124 | tom   |    0.846 | 2019-06-18 14:00:00 |
|  6 | 2019-06-18 14:37:17.457124 | tom   |    0.828 | 2019-06-18 15:00:00 |
|  9 | 2019-06-18 14:52:17.457124 | tom   |    0.9   | 2019-06-18 15:00:00 |
| 12 | 2019-06-18 15:07:17.457124 | tom   |    0.819 | 2019-06-18 15:00:00 |

We can resample ... by grouping and using a grouping function

Group by the Rounded-Date, and the Person (Datetime and Str) objects) - we want in this case the mean value, but there are others also available.


df_resampled=df.groupby(by=['WhenRounded','Who'], axis=0).agg({'Status':'mean'}).reset_index()
# Output in Markdown format
df_to_md(df_resampled[df_resampled.Who=='tom'].head())
|    | WhenRounded         | Who   |   Status |
|---:|:--------------------|:------|---------:|
|  2 | 2019-06-18 14:00:00 | tom   |  0.873   |
|  5 | 2019-06-18 15:00:00 | tom   |  0.83925 |
|  8 | 2019-06-18 16:00:00 | tom   |  0.86175 |
| 11 | 2019-06-18 17:00:00 | tom   |  0.84375 |
| 14 | 2019-06-18 18:00:00 | tom   |  0.8505  |

Lets check the mean for tom @ 14:00

print("Check tom 14:00 .86850  ... {:6.5f}".format((.900+.846+.828+.900)/4))
Check tom 14:00 .86850  ... 0.86850

Hope this assists

Tim Seed
  • 5,119
  • 2
  • 30
  • 26