0

Let's say we have a single line 10 GB file and we need to do some operations on it (it could be just random letters, arggzdfnbnipntrs).

As an example, we could be asked to check if such line is a palindrome, however we cannot load it fully in memory. We'd need to load the first character and compare it to the last, and so on. How is this done in Java? (including UTF-8 support)

Hydroxis
  • 95
  • 1
  • 12
  • Are you allowed to break the file up into multiple lines? – azurefrog Mar 24 '16 at 19:38
  • @azurefrog you can do whatever you want. – Hydroxis Mar 24 '16 at 19:38
  • If you can chop it up, then take a look at how to [read a file line by line in reverse order](http://stackoverflow.com/questions/6011345/read-a-file-line-by-line-in-reverse-order). Compare the first line against the last, the second against the second-to-last, etc. – azurefrog Mar 24 '16 at 19:40
  • That wouldn't be very efficient though. – Hydroxis Mar 24 '16 at 19:41
  • 2
    If you're worried about efficiency, why are you storing your data in a single line in a file instead of a database or something? – azurefrog Mar 24 '16 at 19:45
  • I'm just studying for an exam and this is given as a task that we should think about at home. I don't think this problem can find any practical usage :) – Hydroxis Mar 24 '16 at 19:48
  • SeekableByteChannel http://www.programcreek.com/java-api-examples/index.php?api=java.nio.channels.SeekableByteChannel – Dennis R Mar 24 '16 at 20:14

1 Answers1

2

UTF-8 is a variable length encoding, where each character is 1 to 6 bytes. You can't simply compare the first byte of the file to the last byte. Depending on the encoded length of the first character, you might need to compare the first byte with the sixth-to-last byte.

You can get relatively efficient random file access with RandomAccessFile or FileChannel, but the API (or the underlying file system) wasn't designed for reading "backward". To read backward, every read() would have to be preceded by a seek().

At some level, an entire block is read from the file system and held in memory, so actual seeking and reading of a physical hard drive head is minimized. The overhead involved in making billions of these calls from Java down to the operating system stacks up though, so it might be worthwhile to maintain your own buffer. A seek and a bulk read is performed only when the buffer is empty.

Luckily, your teacher didn't ask for Unicode support as well!

erickson
  • 265,237
  • 58
  • 395
  • 493