0

Let's consider a dataframe with a column of timestamp and a second column of measured values.

    import pandas as pd

    data = {'Time':  [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
            'Value': [13,54,68,94,31,68,46,46,31,53,54,85,75,42,69]}

    df = pd.DataFrame (data, columns = ['Time','Value'])

We want to filter the dataframe to keep only the values at specific timings.

    start = [2, 9, 14]
    end = [5, 12, 15]

In this case, we have 3 timeframes we want to keep; from 2s to 5s, from 9s to 12s, and from 14s to 15s. I created a column that marks the boundaries of the timeframes we want to keep.

    df.loc[df["Time"].isin(start), "Observation"] = 'Start'
    df.loc[df["Time"].isin(end), "Observation"] = 'End'

For filtering the rows, I was thinking of filling the cells between Start and End, and remove the empty rows. And this is where I'm stuck.

I had a go with using:

    df = df.fillna(method='ffill')

The issue with this approach is that I only need this fill to be applied to start (to populate the inside of the timeframe of observation) but I don't want to fill after "End".

My first idea was to create another set of timestamp that would take the timestamp of the end of a session and add 1 to it:

    import pandas as pd

    data = {'Time':  [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
            'Value': [13,54,68,94,31,68,46,46,31,53,54,85,75,42,69]}

    df = pd.DataFrame (data, columns = ['Time','Value'])

    start = [2, 9, 14]
    end = [5, 12, 15]
    out = [x+1 for x in end]

    df.loc[df["Time"].isin(start), "Observation"] = 'Start'
    df.loc[df["Time"].isin(end), "Observation"] = 'End'
    df.loc[df["Time"].isin(out), "Observation"] = 'Out'

    df = df.fillna(method='ffill')

The issue with this approach is that, for the problem I need to solve, the timestamps are not seconds at regular intervals. It is milliseconds at random intervals, so using this +1 to create the "Out" tag is not a reliable method, and it feels I'm overcomplicating something that should be simple; just keeping the observations between the start timestamps and the end timestamps (both timestamps included).

Using a filter (filter/select rows of pandas dataframe by timestamp column) could be an option. However, depending on the session I'm looking at, there can be a random amount of timeframes of interest. I wanted to try and use a for loop scanning through the list of start timestamps and the list of end timestamps to dynamically create such filter, but I didn't manage to get this working.

If anyone knows of a function that does exactly what I need, or that has any tip, that would be great.

Thank you.

Alphynn
  • 73
  • 1
  • 10

1 Answers1

0

How about creating function that zips your start and end lists and checks whether the element is within given pair of values:

def catch_df(start, end, element):
    start_end = zip(start, end)
    for i, z in enumerate(start_end):
        if element >= z[0] and element <= z[1]:
            return "df{}".format(i)

and apply that function to values stored in dataframe df:

df['Result'] = df['Time'].apply(lambda x: catch_df(start, end, x))

so as result you receive following dataframe, which could be easily filtered for None values etc:

    Time Value  Observation Result
0   1    13     NaN         None
1   2    54     Start       df0
2   3    68     NaN         df0
3   4    94     NaN         df0
4   5    31     End         df0
5   6    68     NaN         None
6   7    46     NaN         None
7   8    46     NaN         None
8   9    31     Start       df1
9   10   53     NaN         df1
10  11   54     NaN         df1
11  12   85     End         df1
12  13   75     NaN         None
13  14   42     Start       df2
14  15   69     End         df2
Oskar_U
  • 472
  • 4
  • 13