0

lxml.etree.parse() have generate string in utf-16 file as &#xxxx; How can I convert it back?

Opening output file in web browser is fine. However I still need regular string in output file, too.

Example file:

<?xml version="1.0" encoding="UTF-16"?>
<?xml-stylesheet type="text/xsl" href="xxx.xsl"?>
<TEI.2>
<teiHeader></teiHeader>
<text>
<front></front>
<body>
<p rend="chapter">อธิกรณปจฺจยกถาวณฺณนา</p>

<p rend="bodytext" n="285"><hi rend="paranum">๒๘๕</hi><hi rend="dot">.</hi> <hi rend="bold">วิวาทาธิกรณมฺหา</hi>ติ ‘‘อธมฺมํ ‘ธโมฺม’ติ ทีเปตี’’ติอาทินยปฺปวตฺตา อฎฺฐารสเภทกรวตฺถุนิสฺสิตา วิวาทาธิกรณมฺหาฯ</p>
</body>
<back></back>
</text>
</TEI.2>

Code:

#coding:utf8
import lxml.etree as ET

xml_filename="example.xml"
dom = ET.parse(xml_filename)
print ET.tostring(dom, pretty_print=True))

Example output:

<?xml-stylesheet type="text/xsl" href="xxx.xsl"?><TEI.2>
<teiHeader/>
<text>
<front/>
<body>
<p rend="chapter">&#3607;&#3640;&#3585;&#3617;&#3634;&#3605;&#3636;&#3585;&#3634;&#3611;&#3607;&#3623;&#3603;&#3642;&#3603;&#3609;&#3634;</p>
</body>
<back/>
</text>
</TEI.2>
Bonn
  • 183
  • 14
  • your code doesn't run, i don't see where you defined `xslt` and `newdom` – danidee Sep 18 '16 at 10:55
  • I'm sorry. I have edited. – Bonn Sep 18 '16 at 10:58
  • Now I have used http://stackoverflow.com/a/12614706/3529093, but have gotten error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 125-141: ordinal not in range(128) – Bonn Sep 18 '16 at 11:14

1 Answers1

1

You need to specify the encoding when you use tostring:

dIn [2]: !cat  "test.xml"
��<?xml version="1.0" encoding="UTF-16"?>
<?xml-stylesheet type="text/xsl" href="xxx.xsl"?>
<TEI.2>
    <teiHeader></teiHeader>
    <text>
        <front></front>
        <body>
            <p rend="chapter">-4#"2':2</p>

            <p rend="bodytext" n="285"><hi rend="paranum">RXU</hi><hi rend="dot">.</hi> <hi rend="bold">'4'224#!:+2</hi>4   -!:!M  B!:! 4 5@5  4-24":':2 -:2#*@ #':84*:*42 '4'224#!:+2/</p>
        </body>
        <back></back>
    </text>
</TEI.2>

In [3]: import lxml.etree as ET

In [4]: xml_filename = "test.xml"

In [5]: dom = ET.parse(xml_filename)

utf-16:

In [6]: print ET.tostring(dom, pretty_print=True, encoding="utf-16")
��<?xml version='1.0' encoding='utf-16'?>
<?xml-stylesheet type="text/xsl" href="xxx.xsl"?>
<TEI.2>
    <teiHeader/>
    <text>
        <front/>
        <body>
            <p rend="chapter">-4#"2':2</p>

            <p rend="bodytext" n="285"><hi rend="paranum">RXU</hi><hi rend="dot">.</hi> <hi rend="bold">'4'224#!:+2</hi>4   -!:!M  B!:! 4 5@5  4-24":':2 -:2#*@ #':84*:*42 '4'224#!:+2/</p>
        </body>
        <back/>
    </text>
</TEI.2>

utf-8:

In [7]: print ET.tostring(dom, pretty_print=True, encoding="utf-8")
<?xml-stylesheet type="text/xsl" href="xxx.xsl"?>
<TEI.2>
    <teiHeader/>
    <text>
        <front/>
        <body>
            <p rend="chapter">อธิกรณปจฺจยกถาวณฺณนา</p>

            <p rend="bodytext" n="285"><hi rend="paranum">๒๘๕</hi><hi rend="dot">.</hi> <hi rend="bold">วิวาทาธิกรณมฺหา</hi>ติ ‘‘อธมฺมํ ‘ธโมฺม’ติ ทีเปตี’’ติอาทินยปฺปวตฺตา อฎฺฐารสเภทกรวตฺถุนิสฺสิตา วิวาทาธิกรณมฺหาฯ</p>
        </body>
        <back/>
    </text>
</TEI.2>

ascii(default):

In [8]: print ET.tostring(dom, pretty_print=True, encoding="ascii")
<?xml-stylesheet type="text/xsl" href="xxx.xsl"?>
<TEI.2>
    <teiHeader/>
    <text>
        <front/>
        <body>
            <p rend="chapter">&#3629;&#3608;&#3636;&#3585;&#3619;&#3603;&#3611;&#3592;&#3642;&#3592;&#3618;&#3585;&#3606;&#3634;&#3623;&#3603;&#3642;&#3603;&#3609;&#3634;</p>

            <p rend="bodytext" n="285"><hi rend="paranum">&#3666;&#3672;&#3669;</hi><hi rend="dot">.</hi> <hi rend="bold">&#3623;&#3636;&#3623;&#3634;&#3607;&#3634;&#3608;&#3636;&#3585;&#3619;&#3603;&#3617;&#3642;&#3627;&#3634;</hi>&#3605;&#3636; &#8216;&#8216;&#3629;&#3608;&#3617;&#3642;&#3617;&#3661; &#8216;&#3608;&#3650;&#3617;&#3642;&#3617;&#8217;&#3605;&#3636; &#3607;&#3637;&#3648;&#3611;&#3605;&#3637;&#8217;&#8217;&#3605;&#3636;&#3629;&#3634;&#3607;&#3636;&#3609;&#3618;&#3611;&#3642;&#3611;&#3623;&#3605;&#3642;&#3605;&#3634; &#3629;&#3598;&#3642;&#3600;&#3634;&#3619;&#3626;&#3648;&#3616;&#3607;&#3585;&#3619;&#3623;&#3605;&#3642;&#3606;&#3640;&#3609;&#3636;&#3626;&#3642;&#3626;&#3636;&#3605;&#3634; &#3623;&#3636;&#3623;&#3634;&#3607;&#3634;&#3608;&#3636;&#3585;&#3619;&#3603;&#3617;&#3642;&#3627;&#3634;&#3631;</p>
        </body>
        <back/>
    </text>
</TEI.2>
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321