1

I have a pandas column containing the message field of windows event logs like below. How can I go through and drop all the non key value style pairs?

The message column contains similar data but possibly more key:value types than shown as this is only a single event id.

message
['subject':'none','security id':'s-1-5-12','account name':'myaccountname','account domain':'domain', 'logon id':'0x3e6',    ' process information':'none', 'new process id':'0x1a53', 'new process name':'c:\windows\system32\ipconfig.exe', 'token elevation type':'%%1932','creator process id':'0x1b33', 'process command line':'none',  '  token elevation type indicates the type of token that was assigned to the new process in accordance with user account control policy.',' type 1 is a full token with no privileges removed or groups disabled.  a full token is only used if user account control is disabled or if the user is the built-in administrator account or a service account.', ' type 2 is an elevated token with no privileges removed or groups disabled.  an elevated token is used when user account control is enabled and the user chooses to start the program using run as administrator.  an elevated token is also used when an application is configured to always require administrative privilege or to always require maximum privilege', ' and the user is a member of the administrators group.',' type 3 is a limited token with administrative privileges removed and administrative groups disabled.  the limited token is used when user account control is enabled', ' the application does not require administrative privilege', ' and the user does not choose to start the program using run as administrator.']
['subject':'none','security id':'s-1-5-13','account name':'myaccountname','account domain':'domain', 'logon id':'0x3e6',    ' process information':'none', 'new process id':'0x1a53', 'new process name':'c:\windows\system32\net.exe', 'token elevation type':'%%1932','creator process id':'0x1b33', 'process command line':'none',   '  token elevation type indicates the type of token that was assigned to the new process in accordance with user account control policy.',' type 1 is a full token with no privileges removed or groups disabled.  a full token is only used if user account control is disabled or if the user is the built-in administrator account or a service account.', ' type 2 is an elevated token with no privileges removed or groups disabled.  an elevated token is used when user account control is enabled and the user chooses to start the program using run as administrator.  an elevated token is also used when an application is configured to always require administrative privilege or to always require maximum privilege', ' and the user is a member of the administrators group.',' type 3 is a limited token with administrative privileges removed and administrative groups disabled.  the limited token is used when user account control is enabled', ' the application does not require administrative privilege', ' and the user does not choose to start the program using run as administrator.']

Expected output:

subject  security id   account name  logon id  process information  new processs id                  new process name  token elevation type  creator process id   process command line
   none     s-1-5-12  myaccountname     0x3e6                 none           0x1a53  c:\windows\system32\ipconfig.exe                %%1932              0x1b33                   none

If i could get the non key:value pairs out of my data, I know I can use this method.

pandas list of dictionary to separate columns

Community
  • 1
  • 1
johnnyb
  • 1,745
  • 3
  • 17
  • 47

1 Answers1

1

You can use yaml, which add None values if missing value in dict and then remove all None pairs:

print (df)
                                       message
0  {'a':'none','b':'2', '  token.', ' type 1'}

import yaml

print (df.message.apply(yaml.load))
0    {'  token.': None, ' type 1': None, 'b': '2', ...
Name: message, dtype: object

df.message = df.message.apply(lambda x: {k: v for k, v in yaml.load(x).items() if v})
print (df)
                   message
0  {'b': '2', 'a': 'none'}

With your data:

df = pd.DataFrame({'message':["{'subject':'none', 'security id':'s-1-5-12', 'account name':'myaccountname','account domain':'domain', 'logon id':'0x3e6', ' process information':'none', 'new process id':'0x1a53', 'new process name':'c:\windows\system32\ipconfig.exe', 'token elevation type':'%%1932', 'creator process id':'0x1b33','process command line':'none', '  token elevation type indicates the type of token that was assigned to the new process in accordance with user account control policy.', ' type 1 is a full token with no privileges removed or groups disabled.  a full token is only used if user account control is disabled or if the user is the built-in administrator account or a service account.', ' type 2 is an elevated token with no privileges removed or groups disabled.  an elevated token is used when user account control is enabled and the user chooses to start the program using run as administrator.  an elevated token is also used when an application is configured to always require administrative privilege or to always require maimum privilege', ' and the user is a member of the administrators group.',' type 3 is a limited token with administrative privileges removed and administrative groups disabled.  the limited token is used when user account control is enabled', ' the application does not require administrative privilege', ' and the user does not choose to start the program using run as administrator.'}"]})

import yaml
df.message = df.message.apply(lambda x: {k: v for k, v in yaml.load(x).items() if v})

df1 = pd.DataFrame(df.pop('message').values.tolist(), index=df.index)
print (df1)
   process information account domain   account name creator process id  \
0                 none         domain  myaccountname             0x1b33   

  logon id new process id                  new process name  \
0    0x3e6         0x1a53  c:\windows\system32\ipconfig.exe   

  process command line security id subject token elevation type  
0                 none    s-1-5-12    none               %%1932  

EDIT:

import yaml
df.message=df.message.str[0].apply(lambda x:{k:v for k,v in yaml.load('{'+x+'}').items() if v})

df1 = pd.DataFrame(df.pop('message').values.tolist(), index=df.index)
print (df1)
   process information account domain   account name creator process id  \
0                 none         domain  myaccountname             0x1b33   

  logon id new process id                  new process name  \
0    0x3e6         0x1a53  c:\windows\system32\ipconfig.exe   

  process command line security id subject token elevation type  
0                 none    s-1-5-12    none               %%1932  
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252