1

I'm trying to see if there's a way to use Pandas to do the following calculation:

I have a table which has Event numbers followed by their respective start and end datetimes.

Event Number    Start            End
     1      6/1/2020 13:00  6/1/2020 13:30
     2      6/1/2020 17:45  6/1/2020 19:15
     3      6/4/2020 8:00   6/4/2020 9:10
     4      6/7/2020 11:00  6/7/2020 11:50

And separately I have another table which has the raw time indexed data for each sensor

   Datetime    Sensor 1     Sensor 2    Sensor 3
6/1/2020 0:00     80           4           0
6/1/2020 0:01     80           5           0
6/1/2020 0:02     69           9           1
6/1/2020 0:03     72           8           0
6/1/2020 0:04     60           9           0
6/1/2020 0:05     76           3           0
6/1/2020 0:06     62           8           1
6/1/2020 0:07     80           8           0
6/1/2020 0:08     64           7           1

Is there a way that I can run down that original table and crop this one based on the START and END timestamps?

What I'm trying the final table to look like is something that includes the

  • Open
  • Close
  • Min
  • Max
  • Mean

of all sensors between those dates.

I've tried without much luck to use a for loop but it takes very long (+3M lines of raw data).

Is there any way to do this with pandas?

Thanks to all in advance

EDIT 1: Added what I had in mind as an output for clarification:

Event   Type    Sensor 1    Sensor 2    Sensor 3
  1     Open       60          5           1
  1     Close      69          8           0
  1     Max        78          8           1
  1     Min        59          4           0
  1     Mean       69          8           0.333
  2     Open       77          4           0
  2     Close      73          6           1
  2     Max        77          6           1
  2     Min        68          4           0
  2     Mean       74          6           0.667 
  3     Open       63          4           1
  3     Close      71          7           1
  3     Max        70          8           1
  3     Min        63          3           0
  3     Mean       65          4           1 
adrian
  • 148
  • 8

2 Answers2

0

You can use pd.between to get the sensor values you want and then create a MultiIndexed Dataframe to improve the display of data obtained. You will find helpful info here: how-to-check-where-datetime-is-in-between-two-datetimes-in-pandas and constructing-3d-pandas-dataframe. You can try something like this:

import pandas as pd
import numpy as np
import io
from statistics import mean

s_e = """Event Number    start            end
     1      6/1/2020 0:00  6/1/2020 0:02
     2      6/1/2020 0:05  6/1/2020 0:08"""
s_s = """   Datetime    Sensor 1     Sensor 2    Sensor 3
6/1/2020 0:00     80           4           0
6/1/2020 0:01     80           5           0
6/1/2020 0:02     69           9           1
6/1/2020 0:03     72           8           0
6/1/2020 0:04     60           9           0
6/1/2020 0:05     76           3           0
6/1/2020 0:06     62           8           1
6/1/2020 0:07     80           8           0
6/1/2020 0:08     64           7           1"""
events = pd.read_csv(io.StringIO(s_e), sep='\s\s+', parse_dates=[1,2], engine='python')
sensors = pd.read_csv(io.StringIO(s_s), sep='\s\s+', parse_dates=[0], engine='python')

#We create a dataframe with all values of ['Values','Open','Close','Min' ,'Max', 'Mean'] of each sensor
dfEveSen=pd.DataFrame()
pd.set_option('display.max_columns', None)
for sensor in sensors.columns[1:]:
    #we get the sensor values between start and end of each event 
    dfEveSen[sensor+'values']=[list(sensors[sensor][sensors.Datetime.between(start, end)].agg(list)) for start, end in zip(events['start'], events['end'])] 
    dfEveSen['first'+sensor]=dfEveSen[sensor+'values'].apply(lambda x: x[0])
    dfEveSen['last'+sensor]=dfEveSen[sensor+'values'].apply(lambda x: x[len(x)-1])
    dfEveSen['min'+sensor]=dfEveSen[sensor+'values'].apply(lambda x: min(x))
    dfEveSen['max'+sensor]=dfEveSen[sensor+'values'].apply(lambda x: max(x))
    dfEveSen['mean'+sensor]=dfEveSen[sensor+'values'].apply(lambda x: mean(x))



#We get the data of dfEveSen to create a MultiIndex dataframe
dataa=[dfEveSen[colum] for colum in dfEveSen.columns]
dataa=np.array(dataa)

#we define the the second indexed row ['Values','Open','Close','Min' ,'Max', 'Mean']
A = np.array( ['Values','Open','Close','Min' ,'Max', 'Mean']*3)

#we define the the first indexed row: ['Sensor 1', 'Sensor 2', 'Sensor 3']
B=np.repeat(sensors.columns[1:],6)

#We create the MultiIndex dataframe called sensorevent
sensorevent = pd.DataFrame(data=dataa.T, columns=pd.MultiIndex.from_tuples(zip(B,A)))
sensorevent.index.name = 'event'
sensorevent.index +=1
print(sensorevent)
#if you want to erase the column of values try this:
#sensorevent = sensorevent.drop('Values', axis=1, level=1)
#print(sensorevent)

Output:

                        Sensor 1                                  Sensor 2                              Sensor 3
                 Values Open Close Min Max     Mean        Values Open Close Min Max Mean        Values Open Close Min Max      Mean
event
1          [80, 80, 69]   80    69  69  80  76.3333     [4, 5, 9]    4     9   4   9    6     [0, 0, 1]    0     1   0   1  0.333333
2      [76, 62, 80, 64]   76    64  62  80     70.5  [3, 8, 8, 7]    3     7   3   8  6.5  [0, 1, 0, 1]    0     1   0   1       0.5
MrNobody33
  • 6,413
  • 7
  • 19
-1

First we create an IntervalIndex from Start and End in the events dataframe (df_e). Using get_indexer we get the event numbers from df_e and assign them as new column to the sensor dataframe (df_s). The important thing is here that get_indexer returns -1 for missing values so we have to add a corresponding row for missing event to the end of df_e so that iloc[-1] returns this row and not the last row of the original data. Then we simply groupby the event number.

idx = pd.IntervalIndex.from_arrays(df_e.Start, df_e.End, 'both')
df_s.assign(event=df_e.append(pd.Series(dtype='Int64'), ignore_index=True).iloc[idx.get_indexer(df_s.Datetime), 0].values).groupby('event')[['Sensor 1', 'Sensor 2', 'Sensor 3']].agg(['first', 'last', 'min', 'max', 'mean'])

Example:

import pandas as pd
import io

s_e = """Event Number    Start            End
     1      6/1/2020 0:00  6/1/2020 0:02
     2      6/1/2020 0:05  6/1/2020 0:08"""
s_s = """   Datetime    Sensor 1     Sensor 2    Sensor 3
6/1/2020 0:00     80           4           0
6/1/2020 0:01     80           5           0
6/1/2020 0:02     69           9           1
6/1/2020 0:03     72           8           0
6/1/2020 0:04     60           9           0
6/1/2020 0:05     76           3           0
6/1/2020 0:06     62           8           1
6/1/2020 0:07     80           8           0
6/1/2020 0:08     64           7           1"""
df_e = pd.read_csv(io.StringIO(s_e), sep='\s\s+', parse_dates=[1,2], engine='python')
df_s = pd.read_csv(io.StringIO(s_s), sep='\s\s+', parse_dates=[0], engine='python')

idx = pd.IntervalIndex.from_arrays(df_e.Start, df_e.End, 'both')
df_s.assign(event=df_e.append(pd.Series(dtype='Int64'),ignore_index=True).iloc[idx.get_indexer(df_s.Datetime),0].values).groupby('event')[['Sensor 1', 'Sensor 2', 'Sensor 3']].agg(['first', 'last', 'min', 'max', 'mean'])

Result:

      Sensor 1                         Sensor 2                   Sensor 3                       
         first last min max       mean    first last min max mean    first last min max      mean
event                                                                                            
1           80   69  69  80  76.333333        4    9   4   9  6.0        0    1   0   1  0.333333
2           76   64  62  80  70.500000        3    7   3   8  6.5        0    1   0   1  0.500000


This solution scales well for large datasets. For 100K rows sensor data and 5K events it takes 296 ms whereas the other answer with pd.between needs 16.6 s.
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Do you know maybe why this throws `ValueError: no results` when is executed from Spyder or Jupyter Notebook and when is executed from cmd it works? – MrNobody33 Jun 15 '20 at 10:59
  • @MrNobody33: No. I don't get any errors running the example in Spyder (python 3.8.2, pandas 1.0.4) . If you downvoted the answer maybe you could give some explanation as to why this answer is not useful? – Stef Jun 15 '20 at 11:56