RegEx for XML parsing in BeautifulSoup

Question

I need to parse a file, specifically an XBRL file, with BeautifulSoup and the XML parser. However, the output varies if I use the LXML parser or the XML parser, not being able to use the same regex I used successfully in the lxml parser. I include the output of the script.

The reason I need to use the XML parser is that it mantains capital letters, and I use RegeX because the tag names vary along the file and contains the ":" character.

soup = BeautifulSoup(xbrl, 'xml')
soup.find_all(re.compile('ifrs-full'))
output: []

# But if I use lxml parser and the same RegeX, I get:

soup = BeautifulSoup(xbrl, 'lxml')
soup.find_all(re.compile('ifrs-full'))
output: 
[<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_PerdidasFiscales_1" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_UnusedTaxLossesMember" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
 <ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="TrimestreAcumuladoActual" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>]

How do I solve this problem?

Using `xml` as the parser, can you try the following and let me know the output? `for i in soup.find_all(): if 'ifrs-full' in str(i) and i.attrs!={}: print(i) ` — Jack Fleeting, May 20 '19 at 02:25

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

It may not be the best idea to design regular expressions for this task. However, if we have to, we can use capturing groups, and step by step collect our desired data:

<(.+?):([a-z]+)\s(contextref)(=")(.+?)"\s(decimals)(=")(.+?)"\s(unitref)(=")(.+?)">(.+?)<\/(.+?):([a-z]+)>

If the last comma was desired, we can simply modify it to:

<(.+?):([a-z]+)\s(contextref)(=")(.+?)"\s(decimals)(=")(.+?)"\s(unitref)(=")(.+?)">(.+?)<\/(.+?):([a-z]+)>,?

RegEx

If this expression wasn't desired, it can be modified or changed in regex101.com.

RegEx Circuit

jex.im also helps to visualize the expressions.

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"<(.+?):([a-z]+)\s(contextref)(=\")(.+?)\"\s(decimals)(=\")(.+?)\"\s(unitref)(=\")(.+?)\">(.+?)<\/(.+?):([a-z]+)>"

test_str = ("<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"Duration_Actual_PerdidasFiscales_1\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,\n"
    "<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"Duration_Actual_UnusedTaxLossesMember\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,\n"
    " <ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"TrimestreAcumuladoActual\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Demo

This snippet is just to show that how the capturing groups work:

const regex = /<(.+?):([a-z]+)\s(contextref)(=\")(.+?)\"\s(decimals)(=\")(.+?)\"\s(unitref)(=\")(.+?)\">(.+?)<\/(.+?):([a-z]+)>/gm;
const str = `<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_PerdidasFiscales_1" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_UnusedTaxLossesMember" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
 <ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="TrimestreAcumuladoActual" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

RegEx for XML parsing in BeautifulSoup

1 Answers1

RegEx

RegEx Circuit

Test

Demo