2

I'm trying to use Python and BeautifulSoup 4 (bs4) to convert Inkscape SVGs into an XML-like format for some proprietary software. I can't seem to get bs4 to correctly parse a minimal example. I need the parser to respect self-closing tags, handle unicode, and not add html stuff. I thought specifying the 'lxml' parser with selfClosingTags should do it, but nope! check it out.

#!/usr/bin/python
from __future__ import print_function
from bs4 import BeautifulSoup

print('\nbs4 mangled XML:')
print(BeautifulSoup('<x><c name="b1"><d value="a"/></c></x>',
    features = "lxml", 
    selfClosingTags = ('d')).prettify())

print('''\nExpected output:
<x>
 <c name="b1">
  <d value="a"/>
 </c>
</x>''')

This prints

bs4 mangled XML:
/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.4.1-py2.7.egg/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
<html>
 <body>
  <x>
   <c name="b1">
    <d value="a">
    </d>
   </c>
  </x>
 </body>
</html>

Expected output:
<x>
 <c name="b1">
  <d value="a"/>
 </c>
</x>

I've reviewed related StackOverflow questions, and I am not finding the solution.

This question addresses the html boilerplate, but only for parsing subsections of html, not for parsing xml.

This question pertains to getting beautifulsoup 4 to respect self-closing tags, and has no accepted answers.

This question seems to indicate that passing the selfClosingTags argument should help, but as you can see this now generates a warning BS4 does not respect the selfClosingTags argument, and self-closing tags are mangled.

This question suggests that using "xml" (not "lxml") will cause empty tags to be automatically self-closing. This might work for my purposes, but applying the "xml" parser to my actual data fails because the files contain unicode, which the "xml" parser does not support.

Is "xml" different from "lxml", and is it in the standard that "xml" cannot support unicode, and "lxml" cannot contain self-closing tags? Perhaps I'm simply trying to do something that is forbidden?

Community
  • 1
  • 1
MRule
  • 529
  • 1
  • 6
  • 18

1 Answers1

4

If you want the result to be output as xml then parse it as that. Your xml data can contain unicode, however, you need to declare the encoding:

#!/usr/bin/env python
# -*- encoding: utf8 -*-

The SelfClosingTags is no longer recognized. Instead, Beautiful Soup considers any empty tag to be an empty-element tag. If you add a child to an empty-element tag, it stops being an empty-element tag.

Changing your function to look like this should work (In addition to the encoding):

print(BeautifulSoup('<x><c name="b1"><d value="a®"/></c></x>',
    features = "xml").prettify())

Result:

<?xml version="1.0" encoding="utf-8"?>
<x>
 <c name="b1">
  <d value="aÂŽ"/>
 </c>
</x>
l'L'l
  • 44,951
  • 10
  • 95
  • 146
  • Your encoding declaration declares the encoding of your source file. That is not relevant to the question. – roeland Mar 22 '16 at 02:45
  • @roeland: It's relevant, since without it any unicode characters within the XML will cause a syntax error. `Non-ASCII character: but no encoding declared`; http://python.org/dev/peps/pep-0263/ – l'L'l Mar 22 '16 at 08:38
  • But the OP is loading the XML from SVG files. – roeland Mar 22 '16 at 12:03
  • Stackoverflow doesn't have a way to include files with questions, so I can't make an SVG example, that's why we have the pure Python example. – MRule Mar 22 '16 at 20:42
  • @MRule: [https://gist.github.com/](https://gist.github.com/) works well for more detailed stuff — so you're not really limited to just posting the minimal example here. Then you can just provide a link to it within your question... – l'L'l Mar 22 '16 at 20:48
  • @MRule I think if you have a small example input file, it is OK to post the contents of that file in the question. – roeland Mar 23 '16 at 02:49