2

I have a TEI document containing characters encoded as &stern_1; which are mapped in a separate Zeichen.dtd (Document Type Definition) file. The file Zeichen.dtd contains this:

<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY stern_1 "&#10035;" >

I am using BeautifulSoup4 and lxml-xml as a parser.

Example:

dtd_str = '<!DOCTYPE Zeichen SYSTEM "Zeichen.dtd">'
xml_str = "<p>Hello, &stern_1;!</p>"
from bs4 import BeautifulSoup
soup = BeautifulSoup(dtd_str+xml_str, 'lxml-xml')
print(soup.find('p').get_text())

The code above prints this:

 Hello, !

instead of this:

 Hello, ✳!

I also tried inline DTD, with the same result:

dtd_str = """
<!DOCTYPE html [
    <!ENTITY stern_1 "&#10035;">
]>
"""
xml_str = "<p>Hello, &stern_1;!</p>"

from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_str, 'lxml-xml')
print(soup.find('p').get_text())

output:

Hello, !

Any ideas?

Swen Vermeul
  • 104
  • 6
  • It seems you never put the doctype and p-tag strings together. You always just lookup the xml string, so I suppose the custom character is never loaded. – Borisu Nov 27 '18 at 11:50
  • yes, it should read `BeautifulSoup(dtd_str+xml_str, 'lxml-xml')`, but this doesn't change anything - the issue still persists – trybik Dec 03 '18 at 09:32
  • Thanks, I corrected that. – Swen Vermeul Dec 03 '18 at 13:28

1 Answers1

0

Finally found a working solution to my own problem:

dtd_str = """
<!DOCTYPE html [
    <!ENTITY stern_1 "&#10035;">
]>
"""
xml_str = "<p>Hello, &stern_1;!</p>"
from lxml import etree
tree = etree.fromstring(dtd_str + xml_str)

from bs4 import BeautifulSoup
soup = BeautifulSoup(etree.tostring(tree, encoding='unicode'), "lxml-xml")
print(soup.find('p').get_text())

will print this:

Hello, ✳!

which is exactly what I wanted. The lxml library handles the dtd files correctly, whereas BeautifulSoup has a much nicer and more intuitive API when you need to walk through the tree.

Swen Vermeul
  • 104
  • 6