Regex python encoded xml

Question

I have an html encoded xml payload where I'd like to use regular expressions to extract a certain piece of that xml and write that to a file. I know it's generally not good practice to use regex for xml but I think this a special use case.

Anyway here is a sample encoded xml:

&lt;root&gt;
    &lt;parent&gt;
        &lt;test1&gt;
            &lt;another&gt;
                &lt;subelement&gt;
                    &lt;value&gt;hello&lt;/value&gt;
                &lt;/subelement&gt;
            &lt;/another&gt;
        &lt;/test1&gt;
    &lt;/parent&gt;
&lt;/root&gt;

I ultimately want my result to be:

<test1>
    <another>
        <subelement>
            <value>hello</value>
        </subelement>
    </another>
</test1>

Here is my implementation in python to extract out all the text between the <test1> and </test1> inclusively:

import html
import re

file_stream = open('/path/to/test.xmp', 'r')
raw_data = file_stream.read()
escaped_raw_data = html.unescape(raw_data)

result = re.search(r"<test1[\s\S]*?<\/test1>", escaped_raw_data)

However I get no matches for result, what am I doing wrong? How to accomplish my goal?

In your regex, Instead of `.`, use `[\s\S]` because `.` does not match newlines — Gurmanjot Singh, Feb 04 '18 at 03:18
@Gurman `result = re.search(r"", escaped_raw_data)` still I get a result of `None` — mosawi, Feb 04 '18 at 03:19
Yes. But, I am not proficient in Python. Was just trying to help you with the regex. Let python experts also see your question and attempt answers — Gurmanjot Singh, Feb 04 '18 at 03:26
Try `result = re.search(r".*<\/test1>", escaped_raw_data, re.DOTALL)`. — accdias, Feb 04 '18 at 03:43
You should _not_ use regular expressions to search through XML/HTML. Please consider using an XML parser (say, `BeautifulSoup`). https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — DYZ, Feb 04 '18 at 03:43

score 1 · Accepted Answer · answered Feb 04 '18 at 03:54

This works for me:

import html
import re

raw_data = '''
&lt;root&gt;
    &lt;parent&gt;
        &lt;test1&gt;
            &lt;another&gt;
                &lt;subelement&gt;
                    &lt;value&gt;hello&lt;/value&gt;
                &lt;/subelement&gt;
            &lt;/another&gt;
        &lt;/test1&gt;
    &lt;/parent&gt;
&lt;/root&gt;
'''

escaped_raw_data = html.unescape(raw_data)

result = re.search(r'(<test1>.*</test1>)', escaped_raw_data, re.DOTALL)

if result:
    print(result.group(0))

Regex python encoded xml

1 Answers1