0

I'm working on extracting ISO-8559-2 encoded text from an XML. It works fine, however, there are some special characters which use their corresponding HTML code. The XML file:

<?xml version="1.0" encoding="iso-8859-2"?>
<!DOCTYPE TEI.2 SYSTEM "http://mek.oszk.hu/mekdtd/prose/TEI-MEK-prose.dtd">
<!-- ?xml-stylesheet type="text/xsl" href="http://mek.oszk.hu/mekdtd/xsl/boszorkany_txt.xsl"? -->
<TEI.2 id="MEK-00798">
    <text type="novel">
        <front>
            <titlePage>
                <docAuthor>Jókai Mór</docAuthor>
                <docTitle>
                    <titlePart>Az arany ember</titlePart>
                </docTitle>
            </titlePage>
        </front>
        <body>
            <div type="part">
                <head>
                    <title>A Szent Borbála</title>
                </head>
                <div type="chapter">
                    <head>
                        <title>I. A VASKAPU</title>
                    </head>
                    <p text-align="justify">A kitartó hetes vihar. &#150; Ez járhatlanná teszi a Dunát a Vaskapu
                        között.
                    </p>
                </div>
            </div>
        </body>
    </text>
</TEI.2>

A snippet of the code I use:

        SAXReader reader = new SAXReader();
        reader.setEncoding("ISO-8859-2");

        Document document = reader.read(file);
        Node node = document.selectSingleNode("//*[@type='chapter']/p");
        String text = node.getStringValue();
        // String text = org.jsoup.parser.Parser.unescapeEntities(node.getStringValue(), true);
        // String text = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(node.getStringValue());

I also included in comments some libraries I tried, without any success.

What I want to see is: A kitartó hetes vihar. - Ez járhatlanná teszi a Dunát a Vaskapu között. What I see when I debug is: A kitartó hetes vihar . Ez járhatlanná teszi a Dunát a Vaskapu között.

aBnormaLz
  • 809
  • 6
  • 22
Dániel Barta
  • 1,034
  • 7
  • 13
  • The problem is not parsing; all XML parsers will correctly interpret numeric character references like `A`. The problem is that 150 [is not a valid ISO 8859-2 codepoint](https://en.wikipedia.org/wiki/ISO/IEC_8859-2#Code_page_layout). It appears this XML document uses the ["windows-1250"](https://en.wikipedia.org/wiki/Windows-1250) encoding, despite what is declared in the XML prologue. – VGR May 17 '19 at 14:22
  • Thank you @VGR, this was my problem eventually. I did not realize that it uses two encodings. – Dániel Barta May 20 '19 at 10:33

0 Answers0