Multiple matches in non-greedy XML (Python regex)

Question

I know this topic is asked a lot but I couldn't find an answer to my question:

In the attached image there are many different buffers, and I wish to match only the buffers that have "Lut" in their names (notice there are 2 matches in the string in the image). The problem I have is that the matches contain also the buffers that come before the one I want.

I'm pretty new to regex and still trying to learn as much so any explanation will be appreciated.

Thank you! :)

The string is attached for you comfort (if needed):

<?xml version="1.0" encoding="utf-8"?>
<pimp xmlns:dt="urn:schemas-microsoft-com:datatypes">
    <dllPath>C:\ReplayCode\Apps\Pimp</dllPath>
    <buffers>   
    <buffer name="InputMask">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="MaskErode">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="BlablaLutBla">
            <width>256</width>
            <height>256</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="MaskClose">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="InputVis">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>3</channels>
            <type>IMG</type>
    </buffer>   
        <buffer name="AddMaskEdge">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="EdgeVis">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>3</channels>
            <type>IMG</type>
    </buffer>       
        <buffer name="GrayEdge">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="EdgeMaskMulThreshold">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="MaskMulEdge">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>   
    </buffers>

The regex I tried is this:

<buffer name=".*?Lut.*?">.*?<\/buffer>

And I expected 2 matches:

<buffer name="BlablaLutBla">
            <width>256</width>
            <height>256</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>

and

<buffer name="2ndLutBlabla">
            <width>256</width>
            <height>256</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>

You need a [xml parser](https://docs.python.org/2/library/xml.etree.elementtree.html) — luoluo, Sep 29 '15 at 08:26
The famous [You can't parse \[X\]HTML with regex](http://stackoverflow.com/a/1732454/1099230) — luoluo, Sep 29 '15 at 08:30
ok so no regex for me... :) thanks! Sorry for asking this again. Was pretty sure it's fairy simple task for regex — Omer, Sep 29 '15 at 08:34
Maybe you should paste the regex you've tired here too. The expected output as well. — luoluo, Sep 29 '15 at 08:38
@YOU Thanks that worked! Is there a way to use `[^..]*` on a string (not a single char)? `[^( — Omer, Sep 29 '15 at 08:51
@Omer, I advise you avoid using regex when parsing (getting values from HTML/XML, too) marked up documents, you might have to fix it sooner or later, and without deep understanding how regex works, you will find yourself in a big trouble. Even `[^;]*$` might quickly cause catastrophic backtracking with larger documents. Use [`xml.etree.ElementTree`, see demo here](https://ideone.com/2k9Vhs). — Wiktor Stribiżew, Sep 29 '15 at 08:54
@stribizhev does `xml.etree.ElementTree` contain function like re.sub (in order to change the xml file)? That is my main goal in my project.. — Omer, Sep 29 '15 at 09:00
Replacing existing values is as easy as [`buffer[0].text = "234"`](https://ideone.com/hhVCoU) — Wiktor Stribiżew, Sep 29 '15 at 09:04
@Omer: Are you working with an XML file, or XML string? I will post an answer showing how to modify XML attributes and inner texts. — Wiktor Stribiżew, Sep 29 '15 at 09:12
@stribizhev I use a file, but it's not really a problem to load it into a string (which I do when using regex) — Omer, Sep 29 '15 at 09:36
@Omer: I posted an answer of mine, please have a look and feel free to drop a comment. — Wiktor Stribiżew, Sep 29 '15 at 09:51

score 1 · Accepted Answer · answered Sep 29 '15 at 08:44

You can use BeautifulSoup to parse your tag.

import re
from bs4 import BeautifulSoup

input_xml = ''' some xml '''
soup = BeautifulSoup(input_xml, "lxml-xml")
print soup.find_all('buffer', attrs={"name": re.compile('Lut')})

If you do not have this installed already:

pip install beautifulsoup4
pip install lxml

score 1 · Answer 2 · answered Sep 29 '15 at 09:27

1

Since you need to manipulate the data inside an XML document, use an XML parser. An answer above already shows how to instantiate the XML tree, but does not dwell upon the structure modification.

BTW, if you instantiate the XML from a string, use ET.fromstring

import xml.etree.ElementTree as ET
...
xml = "<<YOUR XML STRING>>" 
root = ET.fromstring(xml)

Else, when reading from a file:

tree = ET.parse('file.xml')
root = tree.getroot()

Then, you can use the following replacements (where you can actually use a regex if necessary, because here you will already have to deal with plain, unmarked up text data):

for buffer in root.findall("buffers/buffer"): 
    if "Lut" in buffer.get("name"):
        buffer.find('width').text = "100"    # Set inner text of buffer child named 'width'
        buffer[1].text = "125"               # Set the 2nd child inner text
        buffer.set('type', 'MY_TYPE');       # Add an attribute to buffer

You can print the updated XML using .dump():

print ET.dump(root)                          # Print updated XML

Or write an updated DOM to the file (if you are working with a file):

tree.write('output.xml')

See IDEONE demo showing modifications on an XML string.

answered Sep 29 '15 at 09:27

Wiktor Stribiżew

607,720
39
448
563

Thank you! Great explanation! :) – Omer Sep 29 '15 at 09:58
I found it difficult to change values of the tree to strings containing '<' and `>`. For instance: Changing width to: `()` Is there a solution expect using str.replace ? – Omer Sep 29 '15 at 10:14
[This is how you can do it](https://ideone.com/i3AafU). If `(<ARG LutResolutionWidth>)` is what you need to be in XML, of course. – Wiktor Stribiżew Sep 29 '15 at 10:22
Actually I want to replace the `<` / `&rt;` with `<` and `>` respectively – Omer Sep 29 '15 at 10:29
`<` and `>` are *XML entities* that represent `<` and `>` respectively. If you write `<` directly, the XML will become invalid. `>` can be used but it is not best practice, it will cause lots of trouble later. – Wiktor Stribiżew Sep 29 '15 at 10:33

score 0 · Answer 3 · answered Sep 29 '15 at 08:45

0

You might want to use xml parsing in python instead, it is quite easy:

import xml.etree.ElementTree as ET
tree = ET.parse(xml)
for buffer in tree.findall("buffers/buffer"): 
    if "Lut" in buffer.get("name"):
        # do your stuff
        pass

answered Sep 29 '15 at 08:45

Jiri

16,425
6
52
68

score 0 · Answer 4 · answered Sep 29 '15 at 09:05

0

<buffer name="[^"]*Lut[^"]*">.*?<\/buffer>

See Demo

In your regex's <buffer name=".*?Lut, it will match from the first <buffer to the first Lut.(non-greedy worked.If greedy,it will match the last Lut)

answered Sep 29 '15 at 09:05

Kerwin

1,212
1
7
14

Multiple matches in non-greedy XML (Python regex)

4 Answers4