I am trying to replace every occurrence of an XML tag in a document (call it the target) with the contents of a tag in a different document (call it the source). The tag from the source could contain just text, or it could contain more XML.
Here is a simple example of what I am not able to get working:
test-source.htm:
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>
</body>
</html>
test-target.htm:
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
</head>
<body>
<replacethis src="test-source.htm"></replacethis>
<p>irrelevant, just here for filler</p>
<replacethis src="test-source.htm"></replacethis>
</body>
</html>
replace_example.py:
import os
import re
from bs4 import BeautifulSoup
# Just for testing
source_file = "test-source.htm"
target_file = "test-target.htm"
with open(source_file) as s:
source = BeautifulSoup(s, "lxml")
with open(target_file) as t:
target = BeautifulSoup(t, "lxml")
source_tag = source.srctxt
for tag in target():
for attribute in tag.attrs:
if re.search(source_file, str(tag[attribute])):
tag.replace_with(source_tag)
with open(target_file, "w") as w:
w.write(str(target))
This is my unfortunate test-target.htm
after running replace_example.py
<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
The first replacethis
tag is now gone and the second replacethis
tag has been replaced. This same problem happens with "insert" and "insert_before".
The output I want is:
<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
Can someone please point me in the right direction?
Additional Complications: The example above is the simplest case where I could reproduce the problem I seem to be having with BeautifulSoup, but it does not convey the full detail of the problem I'm trying to solve. Actually, I have a list of targets and sources. The replacethis
tag needs to be replaced by the contents of a source only if the src
attribute contains a reference to a source in the list. So I could use the replace method, but it would require writing a lot more regex than if I could convince BeautifulSoup to work. If this problem is a BeautifulSoup bug, then maybe I'll just have to write the regex instead.