I am working on a tool which can extract field value pair from some text files. Till now, I was working on windows machine. When i tested the tool on linux, the number of fields i am getting increased.
The Regex :
([^,;\n\v{}<>\t=:\[\]\"\']+?)[=:][ \t]*(?:\"((?:[^\"\\]|\\.)*)\"|([^\t =:\"\n\v\t\{\}\[\]<>](?!(?!,)\S+[:=])(?:[^\n\v\t\{\}\[\]=:<>](?!(?!,)\S+[:=]))*))
The sample file:
05/02/2011 03:47:12 PM
LogName=Security
SourceName= ##Source_Name##
EventCode=4624
EventType=##Event_Type##
Type=##Type##
ComputerName=##Computer_Name##
TaskCategory=##Task_Category##
OpCode=##OpCode##
RecordNumber=##Record_Number##
Keywords=##Keyword_Success##
Message=An account was successfully logged on.
January 05-11 03:47:12 PM
Subject:
Security ID: ##Domain##\SYSTEM
Account Name: ##Computer_Name##
Account Domain: ##Domain##
Logon ID: 0x##System_Logon_Id##
Jan 27 03:47:12 PM
Logon Information:
Logon Type: ##Logon_Type##
Restricted Admin Mode: ##Restricted_Admin_Mode##
Virtual Account: ##Virtual_Account##
Elevated Token: ##Elevated_Token##
Impersonation Level: ##Impersonation_Level##
New Logon:
Security ID: ##Domain##\##User_Name##
Account Name: ##User_Name##
Account Domain: ##Domain##
Logon ID: 0x##Logon_Id##
Linked Logon ID: ##Linked_Logon_Id##
Network Account Name: ##User_Name2##
Network Account Domain: ##Domain2##
Logon ##GUID##: ##Logon_Guid##
Process Information:
Process ID: 0x##Process_Id##
Process Name: ##Process_Name##
Network Information:
Workstation Name: ##Computer_Name##
Source Network Address: ##Network_Ip##
Source Port: ##Network_Port##
Detailed Authentication Information:
Logon Process: ##Logon_Process##
Authentication Package: ##Authentication_Package##
Transited Services: ##Transited_Services##
Package Name (NTLM only): ##Package_Name##
Key Length: ##Key_Length##
Output when ran re.findall() in windows with Python 2.7.14:
[('time_field', '05/02/2011 03:47:12 PM'), ('time_field', 'January 05-11 03:47:12 PM'), ('time_field', 'Jan 27 03:47:12 PM'), ('LogName', 'Security'), ('SourceName', '##Source_Name##'), ('EventCode', '4624'), ('EventType', '##Event_Type##'), ('Type', '##Type##'), ('ComputerName', '##Computer_Name##'), ('TaskCategory', '##Task_Category##'), ('OpCode', '##OpCode##'), ('RecordNumber', '##Record_Number##'), ('Keywords', '##Keyword_Success##'), ('Message', 'An account was successfully logged on.'), ('Security ID', '##Domain##\\SYSTEM'), ('Account Name', '##Computer_Name##'), ('Account Domain', '##Domain##'), ('Logon ID', '0x##System_Logon_Id##'), ('Logon Type', '##Logon_Type##'), ('Restricted Admin Mode', '##Restricted_Admin_Mode##'), ('Virtual Account', '##Virtual_Account##'), ('Elevated Token', '##Elevated_Token##'), ('Impersonation Level', '##Impersonation_Level##'), ('Security ID', '##Domain##\\##User_Name##'), ('Account Name', '##User_Name##'), ('Account Domain', '##Domain##'), ('Logon ID', '0x##Logon_Id##'), ('Linked Logon ID', '##Linked_Logon_Id##'), ('Network Account Name', '##User_Name2##'), ('Network Account Domain', '##Domain2##'), ('Logon ##GUID##', '##Logon_Guid##'), ('Process ID', '0x##Process_Id##'), ('Process Name', '##Process_Name##'), ('Workstation Name', '##Computer_Name##'), ('Source Network Address', '##Network_Ip##'), ('Source Port', '##Network_Port##'), ('Logon Process', '##Logon_Process##'), ('Authentication Package', '##Authentication_Package##'), ('Transited Services', '##Transited_Services##'), ('Package Name (NTLM only)', '##Package_Name##'), ('Key Length', '##Key_Length##')]
Output when ran in Linux with Python 2.7.6:
[('time_field', '05/02/2011 03:47:12 PM'), ('time_field', 'January 05-11 03:47:12 PM'), ('time_field', 'Jan 27 03:47:12 PM'), ('LogName', 'Security'), ('SourceName', '##Source_Name##'), ('EventCode', '4624'), ('EventType', '##Event_Type##'), ('Type', '##Type##'), ('ComputerName', '##Computer_Name##'), ('TaskCategory', '##Task_Category##'), ('OpCode', '##OpCode##'), ('RecordNumber', '##Record_Number##'), ('Keywords', '##Keyword_Success##'), ('Message', 'An account was successfully logged on.'), ('Subject', ''), ('Security ID', '##Domain##\\SYSTEM'), ('Account Name', '##Computer_Name##'), ('Account Domain', '##Domain##'), ('Logon ID', '0x##System_Logon_Id##'), ('Logon Information', ''), ('Logon Type', '##Logon_Type##'), ('Restricted Admin Mode', '##Restricted_Admin_Mode##'), ('Virtual Account', '##Virtual_Account##'), ('Elevated Token', '##Elevated_Token##'), ('Impersonation Level', '##Impersonation_Level##'), ('New Logon', ''), ('Security ID', '##Domain##\\##User_Name##'), ('Account Name', '##User_Name##'), ('Account Domain', '##Domain##'), ('Logon ID', '0x##Logon_Id##'), ('Linked Logon ID', '##Linked_Logon_Id##'), ('Network Account Name', '##User_Name2##'), ('Network Account Domain', '##Domain2##'), ('Logon ##GUID##', '##Logon_Guid##'), ('Process Information', ''), ('Process ID', '0x##Process_Id##'), ('Process Name', '##Process_Name##'), ('Network Information', ''), ('Workstation Name', '##Computer_Name##'), ('Source Network Address', '##Network_Ip##'), ('Source Port', '##Network_Port##'), ('Detailed Authentication Information', ''), ('Logon Process', '##Logon_Process##'), ('Authentication Package', '##Authentication_Package##'), ('Transited Services', '##Transited_Services##'), ('Package Name (NTLM only)', '##Package_Name##'), ('Key Length', '##Key_Length##')]
Fields which are extra generating in linux and NOT in windows:
('Subject', '')
('Logon Information', '')
('Network Information', '')
('Detailed Authentication Information', '')
Here, My confusion is:
- Is it possible to get different output with same regex on different machine.?
- Or the problem is with python versions I am using on both machines.?
- What should I keep in mind if i want to support both the machines.?
note: Here my question is not about whether the regex is right or wrong. Coz debugging that regex might take bit more time and it's not that optimized and neat. It's just about the reason of difference here and what i actually should keep in mind.
update: https://regex101.com/r/rDkBxN/1 gives result same as the windows did.