Prettify xml produced by Beautifulsoup with regex

Question

I'm trying to get valid pretty printed xml in order to pass it further to requests

However, xml "prettifyed" by BeautifulSoup looks like this:

...
 <typ>
  TYPE_1
 </typ>
 <rte>
  AL38941XXXXX
 </rte>
 <sts>
  ADDED
 </sts>
...

Handy way of dealing with such a messy output described here

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

which gives:

 <typ>TYPE_1</typ>
 <rte>AL38941XXXXX</rte>
 <sts>ADDED</sts>

However, when it comes to empty values regex just skipping them, which leads problems when some of values in parsed string were empty.

Example:

 <typ>TYPE_1</typ>
 <rte>AL38941XXXXX</rte>
 <sts>ADDED</sts>
 <ref>
 </ref>

Then requests tries to run query with parameter of ' ' in empty tag, what leads to incorrect query result.

I'm not really fluent in regex so tried >\n\s+</ in another regex, failed and hacked it like this:

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml).replace('>\n ', '><').replace('>\n  ', '><')

And all the "pretty" markup sadly gone... It kinda works, but how this should be done properly?

Do you really need it preffied... if you're just sending it to another system - it shouldn't matter - XML is XML... Could you elaborate on *Then requests tries to run query with parameter of ' ' in empty tag, what leads to incorrect query result.* - I don't see why `requests` should care about such things - it's just going to transmit data as it sees it - not try and parse it... How is `requests` being used there? — Jon Clements, Aug 25 '18 at 12:59
agreed, XML is XML, but would be great to have it human-readable in case of debugging, don't it? As of `requests` - some details were added in order to give more (possible exhaustive) info on problem — im_infamous, Aug 25 '18 at 13:11
Sure... but generally prettify it when you want debugging info. not necessarily for transporting...? — Jon Clements, Aug 25 '18 at 13:13
You've picked a *really* bad example of a "handy way of dealing with this", out of a thread that contains quite a lot of good examples, no less. Since you ask how to do this properly: You should start with never using regex on XML. Not for value extraction and not for pretty printing. Use an XML parser. Many parsers come with pretty-printing support built in (lxml definitely does) so you don't even have to roll your own approach in the first place. — Tomalak, Aug 25 '18 at 13:24
absolutely, but hack persists and got to be replaced nothing about `requests` here so corresponding tag should be deleted? — im_infamous, Aug 25 '18 at 13:27
@Tomalak btw how to deal with `lxml` and pretty-printing without being forced to save data to file first? Every example that I could find starts with `f = open('doc.xml', 'w')` which may be good but what if I don't want write data on each request just to facilitating possible debug. — im_infamous, Aug 25 '18 at 13:34
Take a few minutes to read the answers in the thread you linked yourself. It's all in there. For lxml, for minidom, too, and even for the built-in xml module. — Tomalak, Aug 25 '18 at 13:44
But I agree with Jon. Don't modify the XML just for the sake of it. It takes processing time, and as long as everything works it takes processing time to do something that nobody will ever see. — Tomalak, Aug 25 '18 at 13:56

score 0 · Answer 1 · answered Aug 26 '18 at 22:07

As seen on the comments, don't bother prettifying the output.

If, for debugging purposes, you want to prettify and you need to rely on BeautifulSoup for that task + extra step for 'fixing' text nodes, you may try with this regex:

(<([^\/>]+)>)\s+(?:([\s\S]*?)\s+)??(<\/\2>)

Replace by: $1$3$4

Demo

However, bear in mind that regular expressions may not be the right tool for this. The proof is that the previous regex will fail with CDATA content like this:

 <sts>
   <![CDATA[
    </sts>
   ]]>
 </sts>

Sure, we could fine tune the regex to consider CDATA sections, but even so, it would be probable prone to some other problem. So it would be better to use an XML parser. Or even better to use some XML beautifier that would allow not changing the text nodes. I think on the SO question you linked, there were a couple of recommendations.

Prettify xml produced by Beautifulsoup with regex

1 Answers1