0

I have a file in ISO-8859-1 containing german umlauts and I need to unmarshall it using JAXB. But before I need the content in UTF-8.

@Override
public List<Usage> convert(InputStream input) {
    try {
        InputStream inputWithNamespace = addNamespaceIfMissing(input);
        inputWithNamespace = convertFileToUtf(inputWithNamespace);
        ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
        ...

I get the "file" as an InputStream. My idea was to read the file's content in UTF-8 and make another InputStream to use. This is what I've tried:

private InputStream convertFileToUtf(InputStream inputStream) throws IOException {
    byte[] bytesInIso = ByteStreams.toByteArray(inputStream);
    String stringIso = new String(bytesInIso);
    byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
    String stringUtf = new String(bytesInUtf);
    return new ByteArrayInputStream(bytesInUtf);
}

I have those 2 Strings to check the contents, but even just reading the ISO file, it gives question marks where umlauts are (?) and converting that to UTF_8 gives strange characters like 1/2 and so on.

UPDATE

byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);

byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);  

Verifying contentInIso prints question marks instead of the umlauts and by checking contentInIso instead of umlauts, it has characters like "�".

@Override
    public List<Usage> convert(InputStream input) {
        try {
            InputStream inputWithNamespace = addNamespaceIfMissing(input);

            byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
            String contentInIso = new String(bytesInIso);

            byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
            String contentInUtf = new String(bytesInUtf);

            ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);

This method convert it's called by another one called processUsageFile:

private void processUsageFile(File usageFile) {
        try (FileInputStream fileInputStream = new FileInputStream(usageFile)) {
            usageImporterService.importUsages(usageFile.getName(), fileInputStream, getUsageTypeValidated(usageFile.getName()));
            log.info("Usage file {} imported successfully. Moving to archive directory", usageFile.getName());

If i take the code I have written under the UPDATE statement and put it immediately after the try, the first contentInIso has question marks but the contentInUtf has the umlauts. Then, by going into the convert, jabx throws an exception that the file has a premature end of line.

  • 1
    JAXB can also unmarshal from a Reader, so you don't need to convert from ISO-8859-1 to UTF-8, just construct a Reader to convert the ISO-8859-1 bytes to characters (eg `new InputStreamReader(inputStream, StandardCharsets.ISO_8859_1)`, and pass that Reader to `Unmarshaller.unmarshal(Reader)` instead of `Unmarshaller.unmarshal(InputStream)`. – Mark Rotteveel Jan 18 '21 at 16:34
  • I cannot do that because my unmarshall looks like this: `JAXBElement root = jaxbUnmarshaller.unmarshal(new StreamSource(input), tClass);` So using a reader implies changing some code and tests that I don't want to. I found another way to do this: and ran into another problem. You can see the update. – Dolphy the Reaper Jan 18 '21 at 18:15

1 Answers1

0

Regarding the behaviour you are getting,

String stringIso = new String(bytesInIso);

In this step, you construct a new String by decoding the specified array of bytes using the platform's default charset.

Since this is probably not ISO_8859_1, I think the String you are looking at becomes garbled here.

Jems
  • 11,560
  • 1
  • 29
  • 35