Regex multiple expression

Question

I've got the following structure:

<ins rev="REV-NEU" editindex="0">
    <insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">eins</insacc>
    <insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">zwei</insacc>
    <insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">drei</insacc>
<insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">vier</insacc>
</ins> 
<del rev="REV-NEU" editindex="1">eins</del> 
<insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">fünf</insacc>

With a regex I want to match the ins-tag with multiple insacc-tags (can be 1 or 20) inside.

I tried it with the following regex, but it only matches the last insacc:

<ins rev="[^<]+" editindex="[^<]+">(<(insacc|deldec) rev="[^<]+">([^<]+)</(insacc|deldec)>)+</ins>

Why don't use an `XML parser`, like `xml.etree.ElementTree` from standard library? — alecxe, Jul 31 '14 at 17:05
`Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.` — Gerrat, Jul 31 '14 at 17:18

score 4 · Accepted Answer · answered Jul 31 '14 at 17:08

4

You should use lxml for this.

from lxml import etree
xml = etree.fromstring(xml_string)
ins_tags = xml.xpath('//ins[./insacc]')
for ins_tag in ins_tags:
    # do work

Isn't is simple?

answered Jul 31 '14 at 17:08

Shiplu Mokaddim

56,364
17
141
187

1

Muuuuuch cleaner than any regex I could think of – skamazin Jul 31 '14 at 17:10

score 0 · Answer 2 · edited May 23 '17 at 11:57

0

By all means use lxml or Beautiful Soup (see this answer for why). Regular expressions cannot really do what you want because group counts are fixed. Here's more information: an article on repeating groups in regexes and this SO answer providing an alternative.

edited May 23 '17 at 11:57

Community

1
1

answered Jul 31 '14 at 17:26

Kunal

85
7

score 0 · Answer 3 · answered Jul 31 '14 at 18:11

I defy you to reliably or easily do this with a regex:

# -*- coding: utf 8 -*- 

import xml.etree.ElementTree as et

xml='''\
<data>
<ins rev="REV-NEU" editindex="0">
    <insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">eins</insacc>
    <insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">zwei</insacc>
    <insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">drei</insacc>
<insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">vier</insacc>
</ins> 
<del rev="REV-NEU" editindex="1">eins</del> 
<insacc rev="c3ce7877-42bf-4c41-b3c0-fd225ccaf512">fünf</insacc>
</data>'''      

for child in et.fromstring(xml).iter():
    print child.tag, child.attrib, child.text

Prints:

data {} 

ins {'editindex': '0', 'rev': 'REV-NEU'} 

insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} eins
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} zwei
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} drei
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} vier
del {'editindex': '1', 'rev': 'REV-NEU'} eins
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} fünf

If you just want ./ins/insacc, use xpath:

for child in et.fromstring(xml).findall('./ins/insacc'):
    print child.tag, child.attrib, child.text

Prints:

insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} eins
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} zwei
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} drei
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} vier

If you want all insacc even at the root:

for child in et.fromstring(xml).iter():
    if child.tag=='insacc':
       print child.tag, child.attrib, child.text

insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} eins
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} zwei
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} drei
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} vier
insacc {'rev': 'c3ce7877-42bf-4c41-b3c0-fd225ccaf512'} fünf

Regex multiple expression

3 Answers3