1

I have a XML file that contains non-standard characters (like a weird "quote").

I read the XML using UTF-8 / ISO / ascii + unmarshalled it:

BufferedReader br = new BufferedReader(new InputStreamReader(
                (conn.getInputStream()),"ISO-8859-1"));
        String output;
        StringBuffer sb = new StringBuffer();
        while ((output = br.readLine()) != null) {
            //fetch XML
            sb.append(output);
        }


        try {

            jc = JAXBContext.newInstance(ServiceResponse.class);

            Unmarshaller unmarshaller = jc.createUnmarshaller();

            ServiceResponse OWrsp =  (ServiceResponse) unmarshaller
                    .unmarshal(new InputSource(new StringReader(sb.toString())));

I have a oracle function that will take iso-8859-1 codes, and converts/maps them to "literal" symbols. i.e: "&#x2019" => "left single quote"

JAXB unmarshal using iso, displays the characters with iso conversion fine. i.e all weird single quotes will be encoded to "&#x2019"

so suppose my string is: class of 10–11‐year‐olds (note the weird - between 11 and year)

jc = JAXBContext.newInstance(ScienceProductBuilderInfoType.class);
        Marshaller m = jc.createMarshaller();
        m.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
        //save a temp file
        File file2 = new File("tmp.xml");

this will save in file :

class of 10–11‐year‐olds. (what i want..so file saving works!)

[side note: i have read the file using java file reader, and it out puts the above string fine]

the issue i have is that the STRING representation using jaxb unmarshaller has weird output, for some reason i cannot seem to get the string to represent –.

when I 1: check the xml unmarshalled output:

class of 10?11?year?olds

2: the File output:

class of 10–11‐year‐olds

i even tried to read the file from the saved XML, and then unmarshal that (in hopes of getting the – in my string)

String sCurrentLine;
        BufferedReader br = new BufferedReader(new FileReader("tmp.xml"));
        StringBuffer sb = new StringBuffer();
        while ((sCurrentLine = br.readLine()) != null) {
            sb.append(sCurrentLine);
        }




        ScienceProductBuilderInfoType rsp =  (ScienceProductBuilderInfoType) unm
                .unmarshal(new InputSource(new StringReader(sb.toString())));

no avail.

any ideas how to get the iso-8859-1 encoded character in jaxb?

Nate
  • 1,630
  • 2
  • 24
  • 41
  • What software do you use to display/view the unmarshalled string representation? (the "10?11?year?olds" text) – Joni Aug 22 '13 at 15:10
  • eclipse console. i cannot fig out WHY jaxb is converting the – – Nate Aug 22 '13 at 15:47
  • How do you output the string to the console, with System.out? JAXB decodes entity references because that's what an XML parser should do, though iirc it can be configured to not do it. – Joni Aug 22 '13 at 20:03

1 Answers1

0

Solved: using this tibid code found on stackoverflow

final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
        out.append(ch);
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
        out.append("&#x");
        out.append(Integer.toHexString(codepoint));
        out.append(";");
      }
    }
    return out;
  }
}

HtmlEncoder.escapeNonLatin(MYSTRING)

Nate
  • 1,630
  • 2
  • 24
  • 41