1

Situation like the following.

XML file:

<tag1/>  
<tag2>some_data</tag2>
<tag1>some_another_data</tag1>

tag1 is sometimes self-closing and sometimes has data inside.

code:

from BeautifulSoup import BeautifulStoneSoup
s = '<tag1/><tag2>some_data</tag2><tag1>some_another_data</tag1>'
soup1 = BeautifulStoneSoup(s)
soup2 = BeautifulStoneSoup(s, selfClosingTags=["tag1"])
print soup1.prettify()
print
print soup2.prettify()

output:

<tag1>
 <tag2>
  some_data
 </tag2>
</tag1>
<tag1>
 some_another_data
</tag1>

<tag1 />
<tag2>
 some_data
</tag2>
<tag1 />
some_another_data

In the first case tag1 eats the following tag (if it is not tag1 again), because there is no support of self-closing tags by default. in the second case self-closing tag doesn't support child tags.

I just want to get structure as original xml document. Is it possible with BeautifulSoup? And if it is possible, then how to make all tags self-closing by default? There is a lot of xml files and I don't want to search all such situations manually.

Mikhail M.
  • 5,588
  • 3
  • 23
  • 31
  • I havn't used BeautifulSoup, but based on what you've said it looks like it isn't an XML parser at all, but an HTML parser. (You don't specify which tags are self-closing for XML; it's *always* valid to write an element with no children either way.) It looks like you're using the wrong tool. – Glenn Maynard Jan 30 '11 at 21:46

2 Answers2

2

I'd not recommend BeautifulSoup (not even for HTML parsing). Use ElementTree from the standard library, or lxml, if you need a more powerful XML library.

  • BeautifulSoup just seems more comfortable then ElementTree. And isn't lxml a wrapper to C library? I need to use it in GAE. Don't know if it will work. But anyway I'd like to make BeautifulSoup work if it is possible – Mikhail M. Jan 30 '11 at 19:20
  • 1
    BeautifulSoup specializes in parsing ill-formed HTML. If you have good XML, use a pure XML parser. There are options available on GAE: http://stackoverflow.com/questions/1032724 – Ned Batchelder Jan 30 '11 at 19:33
  • @Ned: How is a question about Java XML parsing on GAE supposed to help concerning Python? –  Jan 30 '11 at 19:47
  • oops: I always forget that Java is available on GAE, and didn't notice that the question I linked to was about Java. Sorry! – Ned Batchelder Jan 30 '11 at 20:55
0

You can tell BeautifulSoup 4 ("bs4") to use a different parser (such as lxml), by specifying it on the constructor. I would avoid earlier versions entirely, and avoid using the default parser with bs4 (for example, it fails to deal with omitted end-tags correctly).

TextGeek
  • 1,196
  • 11
  • 23