0

I have a column called Description in my Dataframe. I have text in that column as below.

Description

Summary: SD1: Low free LOG space in database saptempdb: 2.99% Date: 01/01/2017 Severity: Major Reso
Summary: SD1: Low free DATA space in database 10:101:101:1 2.99% Date: 01/01/2017 Severity: Major Res
Summary: SAP SolMan Sys=SM1_SNG01AMMSOL04,MO=AGEEPM40,Alert=Columnstore Unloads,Desc= ,Cat=Exception

How to extract the Server name or IPs fro the above description. I have around 10000 rows.

I have written as below, to split the senetences as comma separated. Now I need to filter the server names or ips

    df['sentsplit'] = df["Description"].str.split(" ")
    print df
Community
  • 1
  • 1
BPK
  • 65
  • 1
  • 10

1 Answers1

0

The general case of what you're asking is "How do I parse this input?". The task then is what knowledge of your input can you exploit to answer your question? Do all the lines follow one or a few forms? Can you place any restrictions on where the hostname or IP address will be on each line?

Given your input, here's a regex I might apply. Quick and dirty -- not elegant -- but if it's only for 10,000 lines, and a one-off job, who cares? It's functional:

database (\d+:\d+:\d+:\d+)|database (\w+)|Sys=([^, ]+),

This regex assumes that the IP address will always be after the word database and preceded by a space, OR that the hostname will be after the word database, OR that the hostname will be preceded bySys=and followed by a,` or a space.

Obviously, test for your purposes, and fine tune as appropriate. In the Python API:

host_or_ip_re = re.compile(r'database (\d+:\d+:\d+:\d+)|database (\w+)|Sys=([^, ]+),')
for line in log:
    m = host_or_ip_re.searc( line )
    if m:
        print m.groups()

The detail that always trips me up is the difference between match and search. Match only matches from the beginning of the string

hunteke
  • 3,648
  • 1
  • 7
  • 17
  • Thanks for your reply. I agree your point. # First things #: - In most fo the rows, the IPs and the Servernames doesnt start after the word 'database' or 'preceded by' Sys=. #Secondly# I am just doing a POC for 10,000 records. But in Real time scenario, It could be more lakhs of data. – BPK Oct 11 '17 at 07:26