0

For some reason, I need to extract the fields in an xml doc with python re.

here is an eg. of the string I'll be applying the regex on:

payload2 = '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'

Some of the fields you see above like 'clientIP' may not always be present.

The regex I have come up with is:

PAT3 = re.compile(r'.+(event="(?P<event_code>\S*?)"){1}[\S\s]+?(path="(?P<path>[\s\S]+?)"){0,1}[\S\s]+(clientIP="(?P<client_ip>[\S\s]+?)"){0,1}.*', re.UNICODE)

m1 = PAT3.search(payload2)
print m1.groupdict()

output:

{'path': '\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db', 'client_ip': None, 'event_code': '0x80'}

But when I put {1} instead of {0, 1} after (?P<client_ip>[\S\s]+?)") it works. However this defeats the case when the clientIP is absent.

Any idea on how can make the regex work in both cases where a field is present or not present?

thefourtheye
  • 233,700
  • 52
  • 457
  • 497
amrka
  • 49
  • 1
  • 6

2 Answers2

0

First, I have to give you the standard warning against parsing XML with regular expressions, but if you're deadset on that…

You probably don't want to be using [\S\s], as that'll match anything, including going past the quote. To prevent that, you made it non-greedy, but there's a better solution: just make it not match quotes: [^"]. Also note that you can replace {0,1} with ?.

Community
  • 1
  • 1
icktoofay
  • 126,289
  • 21
  • 250
  • 231
0

My advice:

Stop trying to do a big one-line regex.

It's very simple to just break up your code so that it is not only more readable, but easier too.

My version of your code

payloads = [
    '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>',
    '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
]


def scrape_xml(payload):
    import re
    ipv4 = r'clientIP="(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'

    pat3 = dict()
    pat3['event_code'] = r'event="(0[xX][0-9a-fA-F]+?)"'
    pat3['path'] = r'path="(.*?)"'
    pat3['client_ip'] = ipv4

    matches = {}
    for index, regex in enumerate(pat3):
        matches[index] = re.search(
            pattern=pat3[regex],
            string=payload,
            flags=re.UNICODE
        )

    for index in matches:
        if not index:
            print "\n"
        if matches[index] is None:
            pass
        else:
            print matches[index].group(0)

for p in payloads:
    scrape_xml(p)

Output:

path="\c2_emcvnx.ntaplion.prv\CHECK$\demoshare1\Engineering\Benchmarking\Thumbs.db"
event="0x80"

path="\c2_emcvnx.ntaplion.prv\CHECK$\demoshare1\Engineering\Benchmarking\Thumbs.db"
clientIP="172.26.64.233"
event="0x80"

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
  • Works like a charm and also very extensible.Thanks! – amrka May 12 '14 at 18:13
  • You are welcome. I learnt a lot of what I know by using the PyCharm IDE, trying to make every script into a class with clear reusable functions, and reading PEP8. I recommend the same, it will save you a lot of headache on bigger projects :) – Vasili Syrakis May 12 '14 at 22:05