0

I'm having troubles with textwrap.dedent from the Python standard library. I tried to solve the issue described here: How to remove extra indentation of Python triple quoted multi-line strings?, but for an XML like string and it's not working as expected.

help(textwrap.dedent)
Help on function dedent in module textwrap:

dedent(text)
    Remove any common leading whitespace from every line in `text`.
    
    This can be used to make triple-quoted strings line up with the left
    edge of the display, while still presenting them in the source code
    in indented form.
    
    Note that tabs and spaces are both treated as whitespace, but they
    are not equal: the lines "  hello" and "\thello" are
    considered to have no common leading whitespace.  (This behaviour is
    new in Python 2.5; older versions of this module incorrectly
    expanded tabs before searching for common leading whitespace.)

I'm on Python 3.6.9 / Ubuntu 18.04.

Here's an example:

template_string = """\
    <?xml version="1.0" encoding="utf-8" ?>
      <doc>
        <param1>hello</param1>
        <param1>world</param1>
      </doc>
"""

then:

>>> textwrap.dedent(template_string)
'<?xml version="1.0" encoding="utf-8" ?>\n  <xml>\n    <param1>hello</param1>\n    <param1>world</param1>\n  </xml>\n'

I still have extra spaces.

Expected output is as follow (I know I have to also apply .replace('\n','') afterwards):

'<?xml version="1.0" encoding="utf-8" ?><xml>\<param1>hello</param1><param1>world</param1></xml>'

The idea is to end with with a single-line XML in order to write this to a postgres database

swiss_knight
  • 5,787
  • 8
  • 50
  • 92

1 Answers1

0

textwrap.dedent() does not remove all whitespace; it just makes the string line up with the left edge of the display.

You can use string.partition() to separate the XML declaration from the XML body and then apply a regular expression substitution on the body.

import re

template_string = """\
    <?xml version="1.0" encoding="utf-8" ?>
     <doc>
       <param1>hello</param1>
       <param1>world</param1>
     </doc>"""

first, sep, rest = [s.strip() for s in template_string.partition(">")]
xml_decl = first + sep
rest = re.sub("\s", "", rest)  # Remove all whitespace
doc = xml_decl + rest

print(doc)

Result:

<?xml version="1.0" encoding="utf-8" ?><doc><param1>hello</param1><param1>world</param1></doc>

With lxml, it is possible to create a single-line XML by parsing the original XML with a special option that removes whitespace. Below is an example that includes a namespace.

import re
from lxml import etree
 
template_string = """\
    <?xml version="1.0" encoding="utf-8" ?>
      <doc xmlns="http://test.com">
        <param1>hello</param1>
        <param1>world</param1>
      </doc>"""
 
first, sep, rest = [s.strip() for s in template_string.partition(">")]
xml_decl = first + sep
 
# Set up a parser that removes whitespace
parser = etree.XMLParser(remove_blank_text=True)
 
# Parse the XML
root = etree.fromstring(rest, parser)
 
# Create the final single-line output
doc = xml_decl + etree.tostring(root).decode()

print(doc)

Result:

<?xml version="1.0" encoding="utf-8" ?><doc xmlns="http://test.com"><param1>hello</param1><param1>world</param1></doc>
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • Thanks. The problem I see with that is if you have xml tags using a namespace and some values including spaces, such as `` it will collapse them all to ``. – swiss_knight Mar 12 '21 at 20:00
  • Yes, there is always yet another pesky complication... However, a real namespace URI ("stuff with spaces") that contains spaces seems extremely unlikely (if not outright invalid). But the collapsing of `kml xmlns` into `kmlxmlns` is a problem. – mzjn Mar 12 '21 at 20:16
  • Added another solution. Perhaps you can use it. – mzjn Mar 12 '21 at 21:31