Python regEx to find positions of xml data

Question

I want to extract the position of XML data with python regEx or using any other method and the data part can be numbers, words,ip or any tags.

PUT /mg/co.xml HTTP/1.1
Host: 19.16.7.59
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gehko/20100101 Firefox/31.0

<?xml version="1.0" encoding="UTF-8"?>
<!-- THIS DATA SUBJECT TO DISCLAIMER(S) INCLUDED WITH THE PRODUCT OF ORIGIN. -->
<io:zzzz xmlns:io="http://kfj/ledm/iomgmt/2008/11/30" xmlns:dd="http://jkfhkj/dictionaries/1.0/" xmlns:dd3="http://jfja/dictionaries/2009/04/06" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://jcjhjk/ledm/iomgmt/2008/11/30 ../../schemas/gfjbj.xsd">
    <io:aaaa>
        <dd3:bbbb>hjgjg</dd3:bbbb>
    </io:aaaa>
    <io:ccccc>
        <io:dddd>
            <dd3:ffff>15.34.2.5</dd3:ffff>
        </io:dddd>
        <io:eeee>
            <dd3:gggg>67</dd3:gggg>
        </io:eeee>
        <io:iiii>
            <dd3:jjjj><script>jgfjkgkj</script></dd3:jjjj>
        </io:iiii>
    </io:cccc>
</io:zzzz>

Expected Output:

(the data given below are approximate positions)

hjgjg [start off = 59, end off= 64]
15.43.2.5 [start off= 74, end off= 84]
67 [start offset=95, end off=97]
<script>jghjhdjk</script>[ start offset=102, end off=124]

Can anybody please help me sorting out this?

Is there any reason for using regEx? You can better do with other xml parser tools. — Nilesh, Feb 02 '15 at 08:28
For the love of god, please, not another question about using regex to parse xml/html... — Nir Alfasi, Feb 02 '15 at 08:29
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — , Feb 02 '15 at 08:46
what xml parser tool is easy enough to find the positions of xml data — ragu, Feb 02 '15 at 09:21
ANd why would you need th e positions of the data, ratehr than the data itself?? You can't replace data inside a XML file with other strigns that are exatct the same size: XML is a text format - if you need to replace parts of the data, you have to rewrite the file. ANd if you need the positions to get the DATA itself, you better just get the data, don't you? No, in fact, xml parsers won't spill out the positions, but it is not likely you need that. — jsbueno, Feb 02 '15 at 11:33
I require positions for httpReq markers. Will XML parser provides character position? — ragu, Feb 02 '15 at 11:58

score 0 · Answer 1 · answered Feb 02 '15 at 11:27

You should not parse xml with python re as it may fail anytime.Regex is too lame to understand the specifics of xml.Still if you dont get any other alternative try this.

^(?=\s*<dd3:[^>]*>).*?>([^< ]+)<

See demo.

https://regex101.com/r/vD5iH9/40

import re
p = re.compile(r'^(?=\s*<dd3:[^>]*>).*?>([^< ]+)<', re.MULTILINE)
test_str = "PUT /mg/co.xml HTTP/1.1\nHost: 19.16.7.59\nUser-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gehko/20100101 Firefox/31.0\n\n<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!-- THIS DATA SUBJECT TO DISCLAIMER(S) INCLUDED WITH THE PRODUCT OF ORIGIN. -->\n<io:zzzz xmlns:io=\"http://kfj/ledm/iomgmt/2008/11/30\" xmlns:dd=\"http://jkfhkj/dictionaries/1.0/\" xmlns:dd3=\"http://jfja/dictionaries/2009/04/06\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://jcjhjk/ledm/iomgmt/2008/11/30 ../../schemas/gfjbj.xsd\">\n <io:aaaa>\n <dd3:bbbb>hjgjg</dd3:bbbb>\n </io:aaaa>\n <io:ccccc>\n <io:dddd>\n <dd3:ffff>15.34.2.5</dd3:ffff>\n </io:dddd>\n <io:eeee>\n <dd3:gggg>67</dd3:gggg>\n </io:eeee>\n <io:iiii>\n <dd3:jjjj><script>jgfjkgkj</script></dd3:jjjj>\n </io:iiii>\n </io:cccc>\n</io:zzzz>"

re.findall(p, test_str)

Python regEx to find positions of xml data

1 Answers1