downloading xml, delete bom and encode utf8

Question

I am downloading an XML from an FTP Server. And i have to prepare it for my SAX Parser. For this i need to delete the BOM byte and encode it as UTF-8. But somehow it doesnt work with every file.

Here is my code for the two functions:

public static void copy(File src, File dest){

    try {
        byte[] data = Files.readAllBytes(src.toPath());

        writeAsUTF8(dest, skipBom(data));

    } catch (IOException e) {
        e.printStackTrace();
    }
}


private static void writeAsUTF8(File out, byte[] data){

    try {

        FileOutputStream outStream = new FileOutputStream(out);
        OutputStreamWriter outUTF = new OutputStreamWriter(outStream,"UTF8");

        outUTF.write(new String(data, "UTF8"));
        //outUTF.write(new String(data));
        outUTF.flush();
        outStream.close();
        outUTF.close();
    }
    catch(Exception ex){
        ex.printStackTrace();
    }
}

    private static byte[] skipBom(byte[] data){

    int skipBytes = getBomSize(data);

    byte[] tmp = new byte[data.length - skipBytes];

    for(int x = 0; x < tmp.length; x++){
        tmp[x] = data[x + skipBytes];
    }

    return tmp;
}

Any ideas what am i doing wrong?

Have you tried any of the ideas from [this question](http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java/)? — andyb, Jan 27 '14 at 14:27

Joop Eggen · Answer 1 · 2014-01-27T20:24:42.283

1

Simplify.

    writeAsUTF8(dest, data);



try {
    int BOM_LENGTH = "\uFFFE".getBytes(StandardCharsets.UTF_8);
    if (!new String(data, 0, BOM_LENGTH).equals("\uFFFE")) {
        BOM_LENGTH = 0;
    }
    FileOutputStream outStream = new FileOutputStream(out);
    outStream.write(data, BOM_LENGTH, data.length - BOM_LENGTH));
    outStream.close();
}
catch(Exception ex){
    ex.printStackTrace();
}

This checks whether the BOM (U+FFFE) is present. Only reading all as String would be simpler:

String xml = new String(data, StandardCharsets.UTF_8);
xml = xml.replaceFirst("^\uFFFE", "");

Using the Charset instead of String encoding parameter means one Exception less to catch: UnsupportedEncodingException (an IOException).

Detecting the XML encoding:

String xml = new String(data, StandardCharsets.ISO_8859_1);
String encoding = xml.replaceFirst(
        "(?s)^.*<\\?xml.*encoding=([\"'])([\\w-]+)\\1.*\\?>.*$",
        "$2");

if (encoding.equals(xml)) {
    encoding = "UTF-8";
}
xml = new String(data, encoding);
xml = xml.replaceFirst("^\uFFFE", "");

edited Jan 27 '14 at 20:24

answered Jan 27 '14 at 14:44

Joop Eggen

107,315
7
83
138

the BOM is not the problem, deleting it always works. the mainproblem is the encoding, i am reading it as bytes with .readAllBytes() and then trying to save it as utf-8. the source file could have any encoding, but in the end it has to be utf8. – Adam Sam Jan 27 '14 at 15:08
Added using the encoding as declared in the XML. – Joop Eggen Jan 27 '14 at 16:34
this ""(?s)"^.*<\\?xml.*encoding=([\"'])(\w+)\\1.*\\?>.*$", "$2");" doestn work – Adam Sam Jan 27 '14 at 16:41
I corrected the regex: extraneous `"`, missing backslash, forgotten `-` in encoding. – Joop Eggen Jan 27 '14 at 20:25

score 0 · Answer 2 · answered Jan 27 '14 at 14:29

0

Why do you want to delete the BOM byte? You just need to read the file to a string with the encoding the file has and then write the string to a file using UTF-8 encoding.

answered Jan 27 '14 at 14:29

fatih

1,395
10
9

i would not, but then i get an exception while reading it with the sax parser (the symbol at line 1 is not valid, or something like that) – Adam Sam Jan 27 '14 at 14:30
what do you feed to the sax parser? When you feed an input source which contains a reader (knowing that the bytes have to be read as utf-8), then everything should be fine. or do i understand something wrong? – fatih Jan 27 '14 at 14:37
1

@faith: No, that doesn't always work. If the first byte in your input stream is a BOM then SAX will complain about an illegal byte and throw an exception. You need to get rid of the first byte in this case before handing the data on to SAX. – alexkelbo Jan 27 '14 at 14:43
i am feeding him with a File – Adam Sam Jan 27 '14 at 14:43
i'd be very interested in the actual code and the file causing such an exception. – fatih Jan 27 '14 at 14:46

score 0 · Answer 3 · answered Jan 27 '14 at 14:41

I can't figure out what's wrong with your code. I've had the same problem some time ago, and I used the following code to do it. First, the following function reads in a file skipping the first byte. This of course only makes sense, if you are sure that all of your files have a BOM.

public byte[] load (File inputFile, int lines) throws Exception {

    try (BufferedReader reader
        = new BufferedReader(
            new InputStreamReader(
                new FileInputStream(inputFile), "UTF-8")))
    {
        // Discard the Byte Order Mark
        int firstByte = reader.read();

        String line = null;
        int lineCount = 0;

        StringBuilder builder = new StringBuilder();
        while( lineCount <= lines && (line = reader.readLine()) != null ) {
            lineCount += 1;
            builder.append(line + "\n");
        }
    }

    return builder.toString().getBytes();
}

You can rewrite this above function to write the data back to another file in UTF-8. I occasionally used the following method to convert a file on disk to convert it from ISO to UTF-8:

public static void convertToUTF8 (Path p) throws Exception {
    Path docPath = p;
    Path docPathUTF8 = docPath;

    InputStreamReader in = new InputStreamReader(new FileInputStream(docPath.toFile()), StandardCharsets.ISO_8859_1);

    CharBuffer cb = CharBuffer.allocate(100 * 1000 * 1000);
    int c = -1;

    while ( (c = in.read()) != -1 ) {
        cb.put((char) c);
    }
    in.close();

    OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(docPathUTF8.toFile()), StandardCharsets.UTF_8);

    char[] x = new char[cb.position()];
    System.arraycopy(cb.array(), 0, x, 0, x.length);

    out.write(x);
    out.flush();
    out.close();
}

downloading xml, delete bom and encode utf8

3 Answers3