Loop through file content and extract fields using regex in Python

Question

I want to loop through a file and extract only specific numbers using regex and store the numbers in a list called event_id. I then want to compare the number to a dictionary key, and if there is a match in the event, the program will print Matched ID: 1102. I got this far.

But I am trying to extract the event date-time when that event occurred and store it in a lift. So the screen printout should look like this:

Matched ID: 1102
Date and time of event: 2019-08-27 17:16:28.543879

Matched ID: 4611
Date and time of event: 2019-08-27 12:14:08.573156

The file data I am extracting from looks like this:

<EventID Qualifiers="">1102</EventID>
<TimeCreated SystemTime="2019-08-27 17:16:28.543879"></TimeCreated>

<EventID Qualifiers="">4611</EventID>
<TimeCreated SystemTime="2019-08-27 17:16:28.543879"></TimeCreated>

This is my code:

evtxlogs = '/home/user/evtx_logs/'

event_id_regex = r'\W(\d*)\W/EventID\W'  
event_date_regex = r'(\d.*.\d*)\D\W\W\W.imeCreated>'

event_id = []  
event_date = []

eventdict = {'1102':{'count':0},'4611':{'count':0},'4624':{'count':0}}

for dirpath, dirnames, filenames in os.walk(evtxlogs): 

    for xml_file in filenames: 
        if xml_file.lower().endswith('.xml'): 
         
            with open(os.path.join(dirpath,xml_file), 'r') as f:
                data = f.read()
                event_id = re.findall(event_id_regex, data) 
                event_date = re.findall(event_date_regex, data) 
            
                for event_id in event_id:

                    if event_id in eventdict: 
                        print(f"Matched ID: {event_id}")
                        print(f"Date and time of event: {event_date}")

score 2 · Accepted Answer · answered Mar 26 '21 at 09:57

If you have an XML file then it would be better to use it as intended to get the information you're after as processing will likely be faster and the code more robust. See this answer for a simple demonstration of parsing XML with python.

However, answering off of the information provided alone; assuming the two elements that you're after are adjacent to each other, you could simply extend your pattern to also capture the TimeCreated and extract your values using groups, like so:

import re

regex = r"^<EventID Qualifiers=\"\">(\d+)</EventID>$.^<TimeCreated SystemTime=\"(.+?)\""

test_str = ("<EventID Qualifiers=\"\">1102</EventID>\n"
    "<TimeCreated SystemTime=\"2019-08-27 17:16:28.543879\"></TimeCreated>\n\n"
    "<EventID Qualifiers=\"\">4611</EventID>\n"
    "<TimeCreated SystemTime=\"2019-08-27 17:16:28.543879\"></TimeCreated>")

matches = re.finditer(regex, test_str, re.MULTILINE | re.DOTALL)

for matchNum, match in enumerate(matches, start=1):
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        group = match.group(groupNum)
        print (f"Group {groupNum}: {group}")

Output:

Group 1: 1102
Group 2: 2019-08-27 17:16:28.543879
Group 1: 4611
Group 2: 2019-08-27 17:16:28.543879

Loop through file content and extract fields using regex in Python

1 Answers1