0

This might be related to my previous question (on how to convert "för" to "för")

So I have a file that I create in my code. Right now I create it by the following code:

FileWriter fwOne = new FileWriter(wordIndexPath);
BufferedWriter wordIndex = new BufferedWriter(fwOne);

followed by a few

wordIndex.write(wordBuilder.toString()); //that's a StringBuilder

ending (after a while-loop) with a

wordIndex.close();

Now the problem is later on this file is huge and I want (need) to jump in it without going through the entire file. The seek(long pos) method of RandomAccessFile lets me do this.

Here's my problem: The characters in the file I've created seem to be encoded with UTF-8 and the only info I have when I seek is the character-position I want to jump to. seek(long pos) on the other hand jumps in bytes, so I don't end up in the right place since an UTF-8 character can be more than one byte.

Here's my question: Can I, when I write the file, write it in ISO-8859-15 instead (where a character is a byte)? That way the seek(long pos) will get me in the right position. Or should I instead try to use an alternative to RandomAccessFile (is there an alternative where you can jump to a character-position?)

Community
  • 1
  • 1
MrJalapeno
  • 1,532
  • 3
  • 18
  • 37
  • You can make use FileOutputStream? – Shankar Shastri Sep 01 '16 at 08:36
  • If a byte you read has a value less than 128, then it is the first byte in a UTF-8 character. A byte value of 128-255 is the middle of a sequence. You can seek randomly and then find the next byte with a value 0-127. – Phylogenesis Sep 01 '16 at 08:36
  • You can make use of FileOutputStream. http://stackoverflow.com/questions/1001540/how-to-write-a-utf-8-file-with-java – Shankar Shastri Sep 01 '16 at 08:37
  • Possible duplicate of [How to deal with a very large text file?](http://stackoverflow.com/questions/4722743/how-to-deal-with-a-very-large-text-file) – Raedwald Sep 01 '16 at 08:50

1 Answers1

7

Now first the worrisome. FileWriter and FileReader are old utility classes, that use the default platform settings on that computer. Run elsewhere that code will give a different file, will not be able to read a file from another spot.

ISO-8859-15 is a single byte encoding. But java holds text in Unicode, so it can combine all scripts. And char is UTF-16. In general a char index will not be a byte index, but in your case it probably works. But the line break might be one \n or two \r\n chars/bytes - platform dependently.

Re

Personally I think UTF-8 is well established, and it is easier to use:

byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
string = new String(bytes, StandardCharsets.UTF_8);

That way all special quotes, euro, and so on will always be available.

At least specify the encoding:

Files.newBufferedWriter(file.toPath(), "ISO-8859-15");
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Thanks so much for your answer. Here's a question though. If i go with the well esablished UTF-8, how would I solve searching through the file? (right now I can jump to a specific byte-position with RandomAccessFile.seek(long pos) – MrJalapeno Sep 01 '16 at 09:07
  • One could use a memory mapped ByteBuffer, go through that for exact file positions with a Charset.Decoder, and index those positions. So use a FileChannel with "r" (read-only mode) for indexing, That is relatively fast. Start with sample code. – Joop Eggen Sep 01 '16 at 10:39