0

I have a weird problem with files.

I intend to modify the timing of an .srt file, but writing the new file seems to be a weird task.

Here's a sample code I wrote:

import java.io.*;
import java.nio.charset.Charset;

public class ReaderWriter {
    public static void main(String[] args) throws IOException {
        InputStream inputStream = new FileInputStream("D:\\E\\Movies\\English\\1960's\\TheApartment1960.srt");
        Reader reader = new InputStreamReader(inputStream,
                Charset.forName("UTF-8"));
        OutputStream outputStream = new FileOutputStream("output.srt");
        Writer writer = new OutputStreamWriter(outputStream,
                Charset.forName("UTF-8"));

        int data = reader.read();
        while (data != -1) {
            char theChar = (char) data;
            writer.write(theChar);
            data = reader.read();
        }
        reader.close();
        writer.close();
    }
}

This is an image from the original file: enter image description here

However, the resulted file seems like: enter image description here

I searched a lot for a solution but in vain. Any help, please.

Community
  • 1
  • 1
Mohammed Deifallah
  • 1,290
  • 1
  • 10
  • 25
  • How are you viewing the output? In Notepad++ or something like that? Could it be a font issue? – JGFMK Feb 13 '20 at 23:27
  • @JGFMK It's in IntelliJ IDEA. However, I opened it in notepad++ with the same result. – Mohammed Deifallah Feb 13 '20 at 23:29
  • Could it be the original wasn't in UTF-8? - maybe some other charset? https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html – JGFMK Feb 13 '20 at 23:32
  • https://forum.videohelp.com/threads/306125-What-is-encoding-of-an-srt-file – JGFMK Feb 13 '20 at 23:33
  • @JGFMK How can I choose the correct encoding? – Mohammed Deifallah Feb 13 '20 at 23:38
  • The reason I mentioned Notepad++ is it's supposed to detect and show the encoding out of the box. https://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file - that gets mentioned here too - but also there is a java solution in the link - even though it's tagged C# – JGFMK Feb 14 '20 at 07:48
  • https://i.stack.imgur.com/vh5kc.png – JGFMK Feb 14 '20 at 10:38

1 Answers1

2

First a few points:

  • There is nothing wrong with your Java code. If I use it to read an input file containing Arabic text encoded in UTF-8 it creates the output file encoded in UTF-8 with no problems.
  • I don't think there is a font issue. Since you can successfully display the content of the input file there is no reason you cannot also successfully display the content of a valid output file.
  • Those black diamonds with question marks in the output file are replacement characters which are "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". This indicates that the input file you are reading is not UTF-8 encoded, even though the code explicitly states that it is. I can reproduce similar results to yours if the input file is UTF-16 encoded, but specified as UTF-8 in the code.
  • Alternatively, if the input file truly is UTF-8 encoded, specify it as UTF-16 in the code. For example, here is a valid UTF-8 input file with some Arabic text where the code (incorrectly) stated Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-16"));:

    يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.

    And here is the output file, containing the replacement characters because the input stream of the UTF-8 file was incorrectly processed as UTF-16:

    ���⃙臙訠���ꟙ蓙苘Ꟙꛙ藘ꤠ���諘께딠�����ꟙ蓘귘Ꟙ동裘꣙諘꧘谠����꫘뗙藙諙蔠���⃙裘ꟙ蓘귙裘돘꣘ꤠ���⃘ꟙ蓙蓘뫘Ꟙꨮ�

Given all that, simply ensuring that the encoding of the input file is specified correctly in the InputStreamReader() constructor should solve your problem. To verify this, just create another input file and save it with UTF-8 character encoding, then run your code. If it works then you know that the problem was the that the encoding of input file was not UTF-8.

skomisa
  • 16,436
  • 7
  • 61
  • 102