Reading data from UTF-8 text file and tokenize

Question

I'm trying to read UTF-8 from a text file and do some tokenization, but I'm having issues with the encoding:

try {
    fis = new FileInputStream(fName);
} catch (FileNotFoundException ex) {
    //...
}

DataInputStream myInput = new DataInputStream(fis);
    try {
        while (thisLine = myInput.readLine()) != null) {
            StringTokenizer st = new StringTokenizer(thisLine, ";");
            while (st.hasMoreElements()) {
            // do something with st.nextToken();
    }
}
} catch (Exception e) {
//...
}

and DataInputStream doesn't have any parameters to set the encoding!

A rough guide to Java character encoding: http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html — McDowell, May 06 '09 at 19:41

score 5 · Accepted Answer · answered May 06 '09 at 19:27

Let me quote the Javadoc for this method.

DataInputStream.readLine()

Deprecated. This method does not properly convert bytes to characters. As of JDK 1.1, the preferred way to read lines of text is via the BufferedReader.readLine() method. Programs that use the DataInputStream class to read lines can be converted to use the BufferedReader class by replacing code of the form:

     DataInputStream d = new DataInputStream(in);

with:

     BufferedReader d
          = new BufferedReader(new InputStreamReader(in));

BTW: JDK 1.1 came out in Feb 1997 so this shouldn't be new to you.

Just think how much time everyone would have saved if you had read the Javadoc. ;)

score 4 · Answer 2 · answered May 06 '09 at 19:20

4

You can use InputStreamReader:

BufferedReader br = new BufferedReader (new InputStreamReader (source, charset);
while (br.readLine () != null) { ... }

You can also try Scanner, but I'm not sure that it would work fine

answered May 06 '09 at 19:20

Roman

64,384
92
238
332

score 1 · Answer 3 · answered May 06 '09 at 19:20

1

Why not use InputStreamReader and specify the encoding ? You can then wrap with a BufferedReader to provide the readLine() capability.

answered May 06 '09 at 19:20

Brian Agnew

268,207
37
334
440

score 0 · Answer 4 · answered May 06 '09 at 19:28

0

When you are reading text (not binary data) you should use a Reader (not an InputStream). You can than specify the encoding for the vm by doing -Dfile.encoding=utf-8. The Reader will automatically use this encoding. So you could even easily switch the encoding. You can use BufferedReader on FileReader to have a readLine(). The method readLine() has only meaning when reading text otherwise the line endings are just bytes

answered May 06 '09 at 19:28

Norbert Hartl

10,481
5
36
46

Changing the default encoding via the command line (-Dfile.encoding=...) is OK for small utilities, but can have unwanted side-effects for interactions with the system - affecting System.out, for example. – McDowell May 06 '09 at 19:38
To me it sounded like a little utility. So you gain a lot of flexibility by letting java do the magic. You are right that it is not a good idea to switch encoding on a bigger application but having hardcoded encodings throughout your code is not much better. And not specifying file.encoding which leads to the effect that it is taken from the system doesn't save you from side effects either – Norbert Hartl May 06 '09 at 20:06

score 0 · Answer 5 · edited May 23 '17 at 12:08

One very simple way:

File myFile = ...

String contents = Files.toString(myFile, Charsets.UTF_8);
for (String token : contents.split(";")) {
    // do something with token
}

Where Files and Charsets are from Guava. Or if you need to handle the file line by line, start with this instead:

List<String> lines = Files.readLines(myFile, Charsets.UTF_8);

Also note that split() is simpler to use here than StringTokenizer.

Know and use the libraries, as I've become fond of saying. (Of course, reading the whole file at once may not suit all situations.)

Edit (2013): Switched my recommendation from Apache Commons IO to Guava, which is overall cleaner and more actively maintained library.

score 0 · Answer 6 · answered May 06 '09 at 19:59

StringTokenizer is an extremely simple class for text tokenization, I can only recommend it for tasks that do not need to further identify the tokens (i.e. using a dictionary lookup) and that will only be used for western languages.

For more advanced cases regarding western languages, a simple tokenizer can be written based on unicode character classes (this will pick up many kinds of whitespace, delimiting characters etc.) and then extended using regexes to catch special cases (like 'that's', 'C++'...).

Reading data from UTF-8 text file and tokenize

6 Answers6