Regex include line breaks

Question

I have the following xml file

<p style="1">
A
</p>
<div xml:lang="unknown">
<p style="3">
B
C
</div>
<div xml:lang="English">
<p style="5">
D
</p>
<p style="1">
Picture number 3?
</p>

and I just want to get the text between <div xml:lang="unknown"> and </div>. So I've tried this code :

import os, re

html = open("2.xml", "r")
text = html.read()
lon = re.compile(r'<div xml:lang="unknown">\n(.+)\n</div>', re.MULTILINE)
lon = lon.search(text).group(1)
print lon

but It doesn't seem to work.

Parsing XML with regex is the wrong approach to take. Use a parser, and the pain is much less! — Sobrique, Oct 16 '15 at 14:13
http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python — Sobrique, Oct 16 '15 at 14:16
You can split the text at the
, creating a list of
s to iterate over and apply your regex to the list item. — reticentroot, Oct 16 '15 at 15:25

score 3 · Accepted Answer · edited May 23 '17 at 12:03

1) Don't parse XML with regex. It just doesn't work. Use an XML parser.

2) If you do use regex for this, you don't want re.MULTILINE, which controls how ^ and $ work in a multiple-line string. You want re.DOTALL, which controls whether . matches \n or not.

3) You probably also want your pattern to return the shortest possible match, using the non-greedy +? operator.

lon = re.compile(r'<div xml:lang="unknown">\n(.+?)\n</div>', re.DOTALL)

score 0 · Answer 2 · answered Oct 16 '15 at 14:37

you can parse a piece of block code like this , when you in a block and set a flag True, and when you out and set the flag False and break out.

def get_infobox(self):
    """returns Infobox wikitext from text blob
    learning form https://github.com/siznax/wptools/blob/master/wp_infobox.py
    """
    if self._rawtext:
        text = self._rawtext
    else:
        text = self.get_rawtext()
    output = []
    region = False
    braces = 0
    lines = text.split("\n")
    if len(lines) < 3:
        raise RuntimeError("too few lines!")

    for line in lines:
        match = re.search(r'(?im){{[^{]*box$', line)
        braces += len(re.findall(r'{{', line))
        braces -= len(re.findall(r'}}', line))
        if match:
            region = True
        if region:
            output.append(line.lstrip())
            if braces <= 0:
                region = False
                break
    self._infobox = "\n".join(output)
    assert self._infobox
    return self._infobox

score 0 · Answer 3 · answered Oct 16 '15 at 15:31

You can try splitting on the div and just matching on the list item. This works well for regex's on large data as well.

import re

html = """<p style="1">
A
</p>
<div xml:lang="unknown">
<p style="3">
B
C
</div>
<div xml:lang="English">
<p style="5">
D
</p>
<p style="1">
Picture number 3?
</p>
"""

for div in html.split('<div'):
 m = re.search(r'xml:lang="unknown">.+(<p[^<]+)', div, re.DOTALL)
 if m:
   print m.group(1)

Regex include line breaks

3 Answers3