0

I try to read file to string, I've try to make the encode to UTF-8 but still fail, it's return some weird characters in the output.

Here is my function to read file:

private static String readFile(String path, boolean isRaw) throws UnsupportedEncodingException, FileNotFoundException{
    File fileDir = new File(path);
try{    
    BufferedReader in = new BufferedReader(
       new InputStreamReader(
                  new FileInputStream(fileDir), "UTF-8"));

    String str;

    while ((str = in.readLine()) != null) {
        System.out.println(str);
    }

            in.close();
            return str;
    } 
    catch (UnsupportedEncodingException e) 
    {
        System.out.println(e.getMessage());
    } 
    catch (IOException e) 
    {
        System.out.println(e.getMessage());
    }
    catch (Exception e)
    {
        System.out.println(e.getMessage());
    }
    return null;
}

The output of first line is: ��1

Here is my testing file https://www.dropbox.com/s/2linqmdoni77e5b/How.to.Get.Away.with.Murder.S01E01.720p.HDTV.X264-DIMENSION.srt?dl=0

Thanks in advance.

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
Han Tran
  • 2,073
  • 4
  • 22
  • 37

2 Answers2

3

This file is encoded in UTF16-LE and has the Byte order mark which helps to determine the encoding. Use "UTF-16LE" charset (or StandardCharsets.UTF_16LE) and skip the first character of the file (for example, calling str.substring(1) on the first line).

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
  • Thanks, it's solved my problem but I don't need to remove the first character of files. Try to debug and see the result is ok. – Han Tran Dec 01 '15 at 06:50
1

It looks like your file is encoded as a BOM file. If you don't need to handle the BOM character, then open notepad++ and encode your file as UTF-8 without BOM

To handle a BOM file in java, take a look at this apache site for BOMInputStream

Example:

private static String readFile(String path, boolean isRaw) throws UnsupportedEncodingException, FileNotFoundException{
File fileDir = new File(path);

try{
    BOMInputStream bomIn = new BOMInputStream(new FileInputStream(fileDir), ByteOrderMark.UTF_16LE);

    //You can also detect UTF-8, UTF-16BE, UTF-32LE, UTF-32BE by using this below constructure
    //BOMInputStream bomIn = new BOMInputStream(new FileInputStream(fileDir), ByteOrderMark.UTF_16LE, 
    //      ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_32LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_8);

    if(bomIn.hasBOM()){
        System.out.println("Input file was encoded as a bom file, the bom character has been removed");
    }

    BufferedReader in = new BufferedReader(
       new InputStreamReader(
                  bomIn, "UTF-8"));

    String str;

    while ((str = in.readLine()) != null) {
        System.out.println(str);
    }

    in.close();
    return str;
} 
catch (UnsupportedEncodingException e) 
{
    System.out.println(e.getMessage());
} 
catch (IOException e) 
{
    System.out.println(e.getMessage());
}
catch (Exception e)
{
    System.out.println(e.getMessage());
}
return null;
}
Andreas Baaserud
  • 149
  • 3
  • 13
  • Yes, problem is I nee to use the charset "UTF-16LE" like @Tagir Valeev answer. Thanks! – Han Tran Dec 01 '15 at 06:51
  • Yes but dont just remove The first char of your file if it is a bom file. One day you would like to use a none-bom file and you will then end up removing a char you want to be there. Treat a bom file as a bom file, therefore bominputstream comes handy – Andreas Baaserud Dec 01 '15 at 06:56
  • I've try the BOMInputStream but look like it's not work, the bomIn.hasBOM() always return false even with demo files. Are there any way to detect BOM? – Han Tran Dec 01 '15 at 07:29
  • Okay, resolve that with http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java/1835577#1835577 – Han Tran Dec 01 '15 at 07:57