I’m parsing a really huge JSON file of 1.4 TB (it’s a WikiData dump just in case). It’s so big that even simple lines counting takes forever even with the help of optimizations like this Number of lines in a file in Java In order to speed it up I’m going to split the task and run it using both different SSDs on my main machine (so I probably get some extra disk throughput) and other computers I have (maybe using Apache Spark).
And the question is how do I start reading a file from a random position? Skipping the lines is obviously not an option :). And I also would like to try to avoid a physical splitting of this file. It’s actually the easiest and most traffic/disk-space-efficient solution but I would like to explore alternatives for some corner use cases.
Basically speaking I do the following:
JsonParser jp = f.createParser(new File(inputFile));
while(jp.nextToken() != JsonToken.END_OBJECT) {
//Fancy stuff
}
Is there a way to quickly jump to line #20,000,000?