Apache Beam- Filter the Lines and select only those lines having specific keywords and store these lines in a Pandas DataFrame

Question

I have a text file having 10000 log lines like

2022-12-27T00:00:00+00:00 VM_DEV02 sshd[25690]: pam_unix(sshd:session): session closed for user USER7

Main tasks are:

Filter only those lines having-['unauthorized','error','kernel error','OS error','rejected','warning',"error"] these words .
Split the lines into different parts and store this data in a Dataframe using Apache Beam

I have tried this way by writing and calling Functions but it is not working as expected.

    import apache_beam as beam
    from apache_beam.io import ReadFromText
    from apache_beam.pvalue import AsList
    from apache_beam.transforms import Map, Filter
    import pandas as pd

    def extract_fields(line):
        timestamp = line.split(" ")[0]
        print(timestamp)
        hostname = line.split(" ")[1]
        print(hostname)
        process_name = line.split(" ")[2].split("[")[0]
        print(process_name)
        pid = line.split(" ")[2].split("[")[1].split("]")[0]
        print(pid)
        text = " ".join(line.split(" ")[3:])
        print(text)

        return [timestamp, hostname, process_name, pid, text]

    with beam.Pipeline() as pipeline:
        log_lines = pipeline | 'Read log line' >> beam.io.ReadFromText("/Analytics/venv/Jup/CAPE_Apache_Beam/Sample_text_file")
        print(log_lines)
    
        fields = log_lines | 'Extract fields' >> Map(extract_fields)
        print(fields)
 
        df = (fields | 'Write to dataframe' >> Map(lambda fields: pd.DataFrame(fields, columns=['timestamp', 'hostname', 'process_name', 'pid', 'text'])))
        print(df)

Please don't post pictures of code or other text. Instead, copy the original text to your question — James Z, Feb 20 '23 at 15:41
The thing was code was kind of huge and was throwing some indentation errors, anyways now, I have added the code , can anyone please suggest and help me here please. — samh125679, Feb 21 '23 at 07:30
Does this [stack link](https://stackoverflow.com/questions/65144420) help you? — Prajna Rai T, Feb 21 '23 at 11:49
Can you provide the input sample which has the keywords which have to be filtered? — Prajna Rai T, Feb 22 '23 at 09:28

Apache Beam- Filter the Lines and select only those lines having specific keywords and store these lines in a Pandas DataFrame

0 Answers0