Using regular expression to split into columns

Question

I'm in desperate need of help have a data that I would like to split into columns using regex(python), it must use regular expression

Jan  9 01:04:49 syntax sshd(pam_unix)[21354]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=120-123-141-4.hinet-ip.hinet.com  

Jul 10 04:17:11 syntax su(pam_unix)[95367]: session opened for user abc by (uid=0)

May  1 14:06:19 syntax su(pam_unix)[95367]: session closed for user abc

Oct 23 18:08:26 syntax logrotate: ALERT exited abnormally with [1]

Jun 14 21:42:52 syntax su(pam_unix)[95367]: session opened for user cbx by (uid=0)

Supposed output

Supposed Output

It's actually from a url, I made it into a pandas dataFrame and tried using re.split but it gives me error

*ValueError: 1 columns passed, passed data had 24 columns*

Hope can I get the output I need?

Mister Spurious · Answer 1 · 2020-05-10T05:49:38.630

So you could create a named regex like so,

r'(?P<Timestamp>\w{3}\s+\d{1,2}\s\d{1,2}:\d{2}:\d{2})\s(?P<A1>\w+)\s(?P<A2>[\S]+)\:\s(?P<A3>.*)'

If the above regex doesn't work, you can create your own and test it on regex101.com

You can see how I did using the examples you provided here.

Then use str.extract which makes named groups to column names.

The code would look like,

import pandas as pd

df = pd.DataFrame(data=["Jan  9 01:04:49 syntax sshd(pam_unix)[21354]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=120-123-141-4.hinet-ip.hinet.com",
"Jul 10 04:17:11 syntax su(pam_unix)[95367]: session opened for user abc by (uid=0)",
"May  1 14:06:19 syntax su(pam_unix)[95367]: session closed for user abc"], columns=["value"])
print(df)

On console,

 value
0  Jan  9 01:04:49 syntax sshd(pam_unix)[21354]: ...
1  Jul 10 04:17:11 syntax su(pam_unix)[95367]: se...
2  May  1 14:06:19 syntax su(pam_unix)[95367]: se...

Add this for splitting the value column into the columns that you require,

pattern = r'(?P<Timestamp>\w{3}\s+\d{1,2}\s\d{1,2}:\d{2}:\d{2})\s(?P<A1>\w+)\s(?P<A2>[\S]+)\:\s(?P<A3>.*)'

df1 = df['value'].str.extract(pattern, expand=True)
print(df1)

On console,

     Timestamp      A1                     A2                                                 A3
0  Jan  9 01:04:49  syntax  sshd(pam_unix)[21354]  authentication failure; logname= uid=0 euid=0 ...
1  Jul 10 04:17:11  syntax    su(pam_unix)[95367]             session opened for user abc by (uid=0)
2  May  1 14:06:19  syntax    su(pam_unix)[95367]                        session closed for user abc

Hope this helps, Cheers!

Hi, thank you for your help, however I'm still getting an error of `TypeError: Cannot use .str.extract with values of inferred dtype 'bytes'.` do you happen to know the reason? I did link my original df to a url in utf-8 form with a column previous — Jae, May 10 '20 at 08:36
Yes and I ended up with error after converting ,`TypeError: string indices must be integers` — Jae, May 10 '20 at 09:27
`df[value']= df['value'].str.decode("utf-8")` , should do the job from what I've read. Are you doing the same? — Mister Spurious, May 10 '20 at 09:35

wwnde · Accepted Answer · 2020-05-10T06:06:14.440

1

Use regex as follows

Data

 df=pd.DataFrame({'Text':['Jan  9 01:04:49 syntax sshd(pam_unix)[21354]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=120-123-141-4.hinet-ip.hinet.com','Jul 10 04:17:11 syntax su(pam_unix)[95367]: session opened for user abc by (uid=0)','May  1 14:06:19 syntax su(pam_unix)[95367]: session closed for user ab']})

regex= ([A-Za-z]+\s+\d+\s+\d+:\d+:\d+)\s+|(?<=\])[:\s+]+|(?<=[x])\s+

df2=df.Text.str.split('([A-Za-z]+\s+\d+\s+\d+:\d+:\d+)\s+|(?<=\])[:\s+]+|(?<=[x])\s+', n=3, expand=True)

df2.rename(columns=({0:'DROP1',1:'Timestamp', 2:'A1', 3:'DROP', 4:'A2', 5:'DROP2',6:'A3'}),inplace=True)#Rename columns

df2.drop(columns=['DROP2','DROP1','DROP'],inplace=True)#Drop unwanted columns

Basically;

(?<=\])[:\s+]+ Split by space that comes after ]:

or -|

(?<=[x])\s+ Split by space that comes after x

or- |

([A-Za-z]+\s+\d+\s+\d+:\d+:\d+)\s+ Split the timestamp

Outcome

edited May 10 '20 at 06:06

answered May 10 '20 at 05:34

wwnde

26,119
6
18
32

Thanks for helping, however I have an error showing `TypeError: Cannot use .str.split with values of inferred dtype 'bytes'.` Do this have to do with my df being `df = pd.DataFrame(url)` – Jae May 10 '20 at 08:22
What version of pandas are you running? – wwnde May 10 '20 at 08:37
I'm current using the '0.25.1' version – Jae May 10 '20 at 08:40
Can you please convert the bytes to string https://stackoverflow.com/questions/606191/convert-bytes-to-a-string Could also try https://stackoverflow.com/questions/40389764/how-to-translate-bytes-objects-into-literal-strings-in-pandas-dataframe-pytho – wwnde May 10 '20 at 08:44
After converting to str my data became [['J'], ['a'], ['n'], [' '], [' '], ['9'], [' '], ['0'], ['1'], [':'], ['0'], ['4'], [':'], ['4'], ['9'], [' '], ['s'], ['y'], ['n'], ['t'], ['a'], ['x '], [' '], ['s'], ['s'],.... – Jae May 10 '20 at 08:50
I would apply the conversion to the individual column – wwnde May 10 '20 at 08:52
Yes after you have come up with df, convert the column to string – wwnde May 10 '20 at 09:01
I didn't convert it to string before df the results I got were `Empty DataFrame Columns: [raw] Index: []` – Jae May 10 '20 at 09:12
I just requested you try after extracting df and you are telling me otherwise.Is that fair? – wwnde May 10 '20 at 09:14
I'm sorry if I have somewhat made you pissed? I'm currently a novice at what I'm doing so I'm having a hard time understanding pandas on a whole so I'm sorry if overstep my boundaries, from what I know and what I'm trying to say is that I converted it to string before it becomes a df so I removed the string conversion and I did a check but what came up was an empty df, I honestly do not know why and am hoping if you might have some idea but if I have somehow made you felt unfair I'm sorry – Jae May 10 '20 at 09:25
After you come up with df, say your column is called x. Do the following df['x']=df['x'].str.decode("utf-8") – wwnde May 10 '20 at 09:28
Thank you, I managed to finally get it. Sorry for the trouble – Jae May 10 '20 at 09:40

score 0 · Answer 3 · answered May 10 '20 at 05:03

The below regex can split the statement. Required columns will be in capture groups.

(.*:\d\d)\s(.*?)\s(.*?:)\s(.*)

Check the below link for reference:

regexr.com/549bm

For example, the second record will be split into

Jul 10 04:17:11

syntax

su(pam_unix)[95367]:

session opened for user abc by (uid=0)

score 0 · Answer 4 · answered May 10 '20 at 05:03

Just to start off, "it must use regular expression" doesn't make any sense without a reason - for your purposes a few splits are going to be much faster to figure out and probably similar in how robust it is. That being said...

If you want to use regex to parse these syslog-like messages, you simply need to figure out at least 3 of the 4 formats, and then stick them together with (named) groups.

We hope to end up with something like:

re_log = rf'(?P<date>{re_date}) (?P<device>{re_device}) (?P<source>{re_source}): (?P<message>{re_message})'

Note the spaces between the groups, and the colon.

Since the message is unlikely to follow any usable pattern, that will have to be our wildcard:

re_message = r'.*'

Likewise, the device is hopefully a valid device id or hostname, (e.g. no spaces, just alphanum and dashes):

re_device = r'[\w-]+'

We could use datetime or time or some parsing to get a formal parse for the date, but we don't really care, so lets match your format roughly. We don't know if your log format uses leading zeroes or leaves them out, so we allow either:

re_date = r'\w{3} \d{1,2} \d{1,2}:\d{2}:\d{2}'

The source is a bit unstructured, but as long as it doesn't have a space we can just match on everything since we have the colon in the re_log expression to capture it:

re_source = r'[^ ]+'

In the end, trying it out gives us something we can apply to your messages

>>> import re
>>> eg = "Oct 23 18:08:26 syntax logrotate: ALERT exited abnormally with [1]"
>>> m = re.match(re_log, eg)
>>> m.groupdict()
{'date': 'Oct 23 18:08:26',
 'device': 'syntax',
 'source': 'logrotate',
 'message': 'ALERT exited abnormally with [1]'}

CypherX · Answer 5 · 2020-05-10T06:07:27.487

Solution

You need to use the following regex pattern along with pandas.Series.str.findall() to get it quickly and easily.
I also made a convenience function: process_logdata() so you could use it directly. The convenience function is available at the bottom of this answer.

df = process_logdata(log_file_name='logfile.txt')
print(df)

Logic:

Here is the logic of the convenience function, process_logdata().

# regex pattern
pattern = '\s*(\w{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\s+(\S+)\s+(\S+?:)\s+(.*)'

# read log file
df = pd.read_csv('logfile.txt', header=None).rename(columns={0: 'logline'})
# process data
ds = df.logline.str.strip().str.findall(pattern)
a = np.array([list(e) for e in ds]).reshape(ds.size,-1)
# finalize processed data as a dataframe
df = pd.DataFrame(a, columns=['Timestamp', 'A1', 'A3', 'A3'])
print(df)

Example

Here we use the dummy data (provided as string). First we load it into a pandas dataframe and then process it.

import numpy as np
import pandas as pd
import re
from io import StringIO 

s = """
Jan  9 01:04:49 syntax sshd(pam_unix)[21354]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=120-123-141-4.hinet-ip.hinet.com  

Jul 10 04:17:11 syntax su(pam_unix)[95367]: session opened for user abc by (uid=0)

May  1 14:06:19 syntax su(pam_unix)[95367]: session closed for user abc

Oct 23 18:08:26 syntax logrotate: ALERT exited abnormally with [1]

Jun 14 21:42:52 syntax su(pam_unix)[95367]: session opened for user cbx by (uid=0)
"""
s = re.sub('\n\s*\n', '\n', s).strip()
#print(s)

df = pd.read_csv(StringIO(s), header=None).rename(columns={0: 'logline'})
pattern = '\s*(\w{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\s+(\S+)\s+(\S+?:)\s+(.*)'
ds = df.logline.str.strip().str.findall(pattern)
a = np.array([list(e) for e in ds]).reshape(ds.size,-1)
df = pd.DataFrame(a, columns=['Timestamp', 'A1', 'A3', 'A3'])
print(df)

Output:

Convenience Function

import numpy as np
import pandas as pd
import re

def process_logdata(log_file_name):
    """Returns a dataframe created from the log file.
    """
    ## Define regex pattern
    pattern = '\s*(\w{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\s+(\S+)\s+(\S+?:)\s+(.*)'

    ## Read log file
    df = (pd
          .read_csv(log_file_name, header=None)
          .rename(columns={0: 'logline'})
    )
    ## Process data
    ds = df['logline']str.strip().str.findall(pattern)
    a = np.array([list(e) for e in ds]).reshape(ds.size,-1)
    ## Finalize processed data as a dataframe 
    df = pd.DataFrame(a, columns=['Timestamp', 'A1', 'A3', 'A3'])
    return df

Using regular expression to split into columns

Supposed output

5 Answers5

Solution

Example

Convenience Function