Random access to the text file in Java

Question

I’m parsing a really huge JSON file of 1.4 TB (it’s a WikiData dump just in case). It’s so big that even simple lines counting takes forever even with the help of optimizations like this Number of lines in a file in Java In order to speed it up I’m going to split the task and run it using both different SSDs on my main machine (so I probably get some extra disk throughput) and other computers I have (maybe using Apache Spark).

And the question is how do I start reading a file from a random position? Skipping the lines is obviously not an option :). And I also would like to try to avoid a physical splitting of this file. It’s actually the easiest and most traffic/disk-space-efficient solution but I would like to explore alternatives for some corner use cases.

Basically speaking I do the following:

JsonParser jp = f.createParser(new File(inputFile));
while(jp.nextToken() != JsonToken.END_OBJECT) {
     //Fancy stuff
}

Is there a way to quickly jump to line #20,000,000?

Does this answer your question? [Fast random access read/write access on big files in java](https://stackoverflow.com/questions/65717446/fast-random-access-read-write-access-on-big-files-in-java) — Jorge Campos, Apr 02 '23 at 19:24
You’ll have to parse the entire file once without storing its contents, and save a table of file positions (possibly in another file). JSON records are not fixed width, so there is no way around this. — VGR, Apr 02 '23 at 19:34
can you transform json into something else, or obtain another format entirely? (parquet would be nice, but even csv would help) — njzk2, Apr 02 '23 at 19:47

score 3 · Accepted Answer · answered Apr 02 '23 at 19:37

Your question assumes that your JSON has lines endings, which it most likely won't have. Such large files are likely stripped from all unneeded characters and line endings are certainly not needed in a JSON file.

You're already using the Jackson Streaming API which is good, because it's your only chance to process such a large file. While you can't seek to a certain line, you can seek to a certain (byte) position, using RandomAccessFile.html#seek(long). You need to "guessstimate" the position you want to jump to (based on the total file size). Since your seek will likely you put in a random position (e.g. inside a attribute value) you probably need to first use some custom parsing rules to find a valid starting point to let the JSON Streaming Parser start. Once you figured out when exactly in the JSON you are, you can use the parser as usual.

Thank you, Raphael, and thanks to @Jorge Campos too. So the task can be solved by using the FileChannel. It's possible to read a portion of a file into the byte array and use JsonFactory.createParser(byte[]) method then. And, yes, it will be necessary to trim this array from both sides to make sure that only full lines of text are present. — Sogawa-sps, Apr 08 '23 at 22:25

Random access to the text file in Java

1 Answers1