BeautifulSoup replace tag by raw xml without parsing/escaping

Question

Say I have xml like this (the real one is more complicated):

<a>
    <b>
        <c replace="alpha" />
    </b>
    <d>
        <c replace="beta"></c>
    </d>
</a>

I've parsed this with BeautifulSoup (lxml) since I can't use regex. Now I'm replacing the <c> with a string containing new valid xml that depends on the attribute. This isn't all that hard.

But I want to do it without parsing the new xml with BeautifulSoup. The reason is that I'm just going to prettify it right after. There are quite some tags being replaced by significant amounts of xml. It's not very performant parsing and then prettifying everything.

Is there something like a LiteralXmlPleaseDontParseThisTnx node? (I can't find it, they must have called it something else, and there are too many unrelated hits for 'raw html', 'unparsed html', 'literal hmtl'...).

Alternatively, is there a way to prettify the above xml and then insert the new xml in that as plain text (without making assumptions about the xml beyond being valid)?

Take a look at [lxml](http://lxml.de). Perhaps it may help you. — Mauro Baraldi, Feb 11 '16 at 01:02

b4hand · Answer 1 · 2016-02-11T01:31:00.557

BeautifulSoup is for parsing HTML. What you have is not HTML, but XML, so you probably shouldn't be using BeautifulSoup, but rather using lxml directly.

The lxml Element does have a replace method, but you must pass it an Element, not a string. It's unclear what you're trying to replace <c> with, but if you create your replacement value as an Element from the beginning you can do the replacement without parsing.

If instead you simply want to drop an arbitrary string in place of <c>, well, that's not a well-formed operation on an XML document, and there's no way the library could guarantee that what you pasted in is well-formed, and thus it would be impossible to serialize the given result. Most XML libraries are going to specifically forbid that operation since it would violate the underlying assumptions and guarantees that the XML library is trying to maintain.

I'm replacing `` with a string containing valid xml (edited the question to clarify). Thanks for the hint using `lxml` directly, I've not had trouble with `BeautifulSoup` but I'd better use the appropriate library. Too bad there may not be a solution to the performance thing... — Mark, Feb 11 '16 at 10:46

score 0 · Accepted Answer · answered Feb 11 '16 at 23:17

I found a way to create the same result, which happens to work for me but might not be generally applicable. It's in the 'alternatively' category of the question: do the replacement outside the parsed soup.

Escape string formatting curly brackets before parsing the main document:

escaped = sub(r'({|})', r'\1\1', input)
soup = BeautifulSoup(escaped, 'lxml')  # or lxml

Replace <c replace="alpha" /> by a substitution string (for all of them):

name = c_tag.attrs['replace']
ctag.replace_with(NavigableString('{' + name + ':s}'))

Store all replacements in a dictionary (perhaps already the case):
```
rep = {'alpha': '<lots /><of-xml />', 'beta': '<b>hi</b>'}
```
Make all the replacements using string formatting:
```
output = soup.prettify().format(**rep)
```

I'll admit my case is a little special, so maybe it doesn't help many others. But in my case each <c> could be replaced by xml that contained more <c>s. Each level would need to be parsed or pickled due to multi-proces communication. (Pickling is only 20-50% faster than parsing, and runs into the hard recursion limit). So having to do this just once instead of for each level saves me a lot of time (factor 3 in a case I tested), since regex replace and string substitutions are much faster than parsing.

BeautifulSoup replace tag by raw xml without parsing/escaping

2 Answers2