3

I am a beginner for Python programming. I am exploring Regex. I am trying to Extract a word(Database name) from the Description column. I am not able to give multiple Regex patterns.

please see the description and the code below.

Description

Summary: AD1: Low free DATA space in database AD1ADS: 10.00% Date: 06/28/2017 Severity: Warning Res
Summary: Database SV1V1CH has used log space: 90.00% Date: 02/06/2017 Severity: Warning ResourceId: s
Summary: SAP SolMan Sys=SM1Tempdb,MO=AGEEPM49,Alert=Database Host Status,Desc=A database hos
*** Clearing Event Received *** SNG01AMMSOL04_age SAP SolMan Sys=SM1_SNG01AMMSOL04,MO=AGEEQM46,Alert

Expected Ouput of DB Names Extracted

AD1ADS
SV1V1CH
SM1Tempdb
SNG01AMMSOL04

Code Tried

sentence = df['Description']
frame = pd.DataFrame({'logs': sentence})

import re
pattern = re.compile(r'[dD]atabase (\w+)|Sys=(\w+)')

for _, line in frame.iterrows():
    name = pattern.findall(line['logs'])
    if name:
        line['names'] = name[0]
    else:
        line['names'] = 'Miscellaneous'

Could anyone please tell me, what mistake I am doing it here.

Output which I am getting Now

(u'AD1ADS', u'')
(u'SV1V1CH', u'')
(u'', u'CM1_CHE01AMMSOL04')
Miscellaneous
Pablo C
  • 4,661
  • 2
  • 8
  • 24
BPK
  • 65
  • 1
  • 10

2 Answers2

4

You can use str.extract with fillna:

p = r'[dD]atabase (\w+)|Sys=(\w+)'
s = df['logs'].str.extract(p, expand=True)
print (s)
         0                  1
0   AD1ADS                NaN
1  SV1V1CH                NaN
2      NaN          SM1Tempdb
3      NaN  SM1_SNG01AMMSOL04

df['db'] = s[0].fillna(s[1]).fillna('Miscellaneous')
#alternatively 
#df['db'] = s[0].combine_first(s[1]).fillna('Miscellaneous')
print (df)
                                                logs                 db
0  Summary: AD1: Low free DATA space in database ...             AD1ADS
1  Summary: Database SV1V1CH has used log space: ...            SV1V1CH
2  Summary: SAP SolMan Sys=SM1Tempdb,MO=AGEEPM49,...          SM1Tempdb
3  *** Clearing Event Received *** SNG01AMMSOL04_...  SM1_SNG01AMMSOL04

And if want extract all possible values use extractall and then join them if necessary:

p = r'[dD]atabase (\w+)|Sys=(\w+)'
s = df['logs'].str.extractall(p)
print (s)
               0                  1
  match                            
0 0       AD1ADS                NaN
1 0      SV1V1CH                NaN
2 0          NaN          SM1Tempdb
  1         Host                NaN
  2          hos                NaN
3 0          NaN  SM1_SNG01AMMSOL04

df['db'] = s[0].fillna(s[1]).groupby(level=0).apply(', '.join)
df['db'] = df['db'].fillna('Miscellaneous')
print (df)
                                                logs                    db
0  Summary: AD1: Low free DATA space in database ...                AD1ADS
1  Summary: Database SV1V1CH has used log space: ...               SV1V1CH
2  Summary: SAP SolMan Sys=SM1Tempdb,MO=AGEEPM49,...  SM1Tempdb, Host, hos
3  *** Clearing Event Received *** SNG01AMMSOL04_...     SM1_SNG01AMMSOL04
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • p = r'[dD]atabase (\w+)|Sys=(\w+)| SAP: (\w+)' s = frame['logs'].str.extract(p, expand=True) print (s) frame['DBNames'] = s[0].fillna(s[1]) frame['DBNames'] = s[0].fillna(s[2]) Can I modify as below, fo multiple condition – BPK Oct 30 '17 at 09:26
  • Thanks Ashish and @jezrael – BPK Oct 30 '17 at 09:29
  • @BPK - do you want for multiple condition new columns? For each pattaern one new column? – jezrael Oct 30 '17 at 09:33
  • Yes. As of now, I am able to see columns, 0 and 1 for two match conditions and combined it using df['db'] = s[0].fillna(s[1]).groupby(level=0).apply(', '.join) But, now adding another condition, say - p = r'[dD]atabase (\w+)|Sys=(\w+)|SAP: (\w+)' How can I combine 0,1 and 2. replacing the NAs – BPK Oct 30 '17 at 09:46
  • I get it, need `s.apply(lambda x: ','.join(x.dropna()), axis=1).groupby(level=0).apply(', '.join)` - it working with multiple columns nice – jezrael Oct 30 '17 at 09:50
1
p = r'[dD]atabase (\w+)|Sys=(\w+)|SAP: (\w+)'
s = df['logs'].str.extractall(p)
print (s)

df['DBNames'] = s.apply(lambda x: ','.join(x.dropna()),axis=1).groupby(level=0).apply(', '.join)
df['DBNames'] = df['DBNames'].fillna('Miscellaneous')
print df

This worked for me :)

BPK
  • 65
  • 1
  • 10