0

Suppose I have a log file parsed and placed into a pandas.DataFrame.

I'm interested to create a new boolean column which will have True only if the current line has EXPRESSION_1 string in it, and the next line has the EXPRESSION_2 expression in it.

I can do it for just a single expression, as shown in the Example 1 below:

Example 1:

import pandas as pd


EXPRESSION_1 = 'Starts streaming the stream rtspsrc'
EXPRESSION_2 = 'initializing gst pipeline'
df = pd.DataFrame(
    {
        'message': [
            'Some log text',
            'Some log text',
            'Starts streaming the stream rtspsrc',
            'initializing gst pipeline',
            'Some log text',
            'Starts streaming the stream rtspsrc',
            'initializing gst pipeline',
            'Some log text',
        ]

    }
)
df.loc[:, 'process_started'] = df.loc[:, 'message'].apply(lambda msg: True if msg.find(EXPRESSION_1) > -1 else False)
df

Output of Example 1:

    message                                 process_started
0   Some log text                           False
1   Some log text                           False
2   Starts streaming the stream rtspsrc     True
3   Some log text                           False
4   Some log text                           False
5   Starts streaming the stream rtspsrc     True
6   initializing gst pipeline               False
7   Some log text                           False

Desired Output:

    message                                 process_started
0   Some log text                           False
1   Some log text                           False
2   Starts streaming the stream rtspsrc     False # <= Note the False here
3   Some log text                           False
4   Some log text                           False
5   Starts streaming the stream rtspsrc     True
6   initializing gst pipeline               False
7   Some log text                           False

Thanks in advance for any suggestions.

Michael
  • 2,167
  • 5
  • 23
  • 38

2 Answers2

1

You can use the shift operation to do this. The shift(-1) in the code is shifting the message column by 1 in the upward direction (in simple words):

import pandas as pd

EXPRESSION_1 = 'Starts streaming the stream rtspsrc'
EXPRESSION_2 = 'initializing gst pipeline'
df = pd.DataFrame(
    {
        'message': [
            'Some log text',
            'Some log text',
            'Starts streaming the stream rtspsrc',
            'Some log text',
            'Some log text',
            'Starts streaming the stream rtspsrc',
            'initializing gst pipeline',
            'Some log text',
        ]

    }
)
df.loc[:, 'process_started'] = df.loc[:, 'message'].apply(lambda msg: True if msg.find(EXPRESSION_1) > -1 else False)

df.loc[(df['message'] == EXPRESSION_1) & (df['message'].shift(-1) == EXPRESSION_2), 'process_started'] = True
df.loc[(df['message'] == EXPRESSION_1) & (df['message'].shift(-1) != EXPRESSION_2), 'process_started'] = False

Output:

    message                                 process_started
0   Some log text                           False
1   Some log text                           False
2   Starts streaming the stream rtspsrc     False
3   Some log text                           False
4   Some log text                           False
5   Starts streaming the stream rtspsrc     True
6   initializing gst pipeline               False
7   Some log text                           False
Aditya
  • 1,357
  • 1
  • 9
  • 19
  • Nice idea, though it does not fit my setup, as the `message` column in my `DataFrame` won't be just `EXPRESSION_1` or `EXPRESSION_2`. Thanks though – Michael Apr 28 '21 at 14:27
0

Found an answer in this answer:

import pandas as pd


EXPRESSION_1 = 'Starts streaming the stream rtspsrc'
EXPRESSION_2 = 'initializing gst pipeline'
df = pd.DataFrame(
    {
        'message': [
            'Some log text',
            'Some log text',
            'Starts streaming the stream rtspsrc',
            'Some log text',
            'Some log text',
            'Starts streaming the stream rtspsrc',
            'initializing gst pipeline',
            'Some log text',
        ]

    }
)
df.loc[:, 'process_started'] = df.apply(lambda row: 
                                        True if 
                                        row.loc['message'].find(EXPRESSION_1) > -1 
                                        and 
                                        (False if row.name+1 > df.shape[0] else df.loc[row.name+1, 'message'].find(EXPRESSION_2) > -1)
                                        else False, 
                                        axis=1
                                       )
df

Output:

    message                                 process_started
0   Some log text                           False
1   Some log text                           False
2   Starts streaming the stream rtspsrc     False
3   Some log text                           False
4   Some log text                           False
5   Starts streaming the stream rtspsrc     True
6   initializing gst pipeline               False
7   Some log text                           False
Michael
  • 2,167
  • 5
  • 23
  • 38