2

I'm looking for thoughts as to what might be the most efficient way to write/read a large (10,000,000+) set of key/value pairs each consisting of a string of arbitrary length followed by a long integer to/from a file in Java. Any suggestions much appreciated.

kujawk
  • 837
  • 1
  • 9
  • 11
  • Efficient in terms of cpu or programmer time? – meriton Nov 02 '12 at 21:15
  • 1
    Do you expect to have all of these 10'000'000 elements stored in memory at the same time, or are you only reading them from some source, doing limited processing on a few elements in memory and then writing them back to some destination? because with 10'000'000 elements, you're talking about using at least 250-300 MiB assuming each of your strings contain just 1 character. If your strings are in the KiB, you're talking about 250-300 GiB, which clearly calls for another solution that reading it all in memory and them dumping it back on the HDD. – LordOfThePigs Nov 02 '12 at 21:29
  • Presumably also depends on how you want to access the data later, or whether it's just for archive? An 'efficient' write isn't much use if you can't access/load the data efficiently afterwards... – DNA Nov 02 '12 at 21:31

3 Answers3

2

This is what the Properties API is for:

http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html

Notice that there are methods that operate on InputStreams, OuputStreams, PrintStreams, and different kinds of Readers and Writers.

jahroy
  • 22,322
  • 9
  • 59
  • 108
  • Using properties with over 10'000'000 elements, you're going to run into memory problems pretty fast, no? – LordOfThePigs Nov 02 '12 at 21:25
  • @LordOfThePIgs - I guess that's probably true (I kinda glossed over the _10,000,000 elements_ comment). I suppose _BufferedInputStream_ and _BufferedOutputStream_ could help in some cases. – jahroy Nov 02 '12 at 21:28
  • No jahroy, he is talking about the amount of memory 10,000,000 element properties class will use, there is a 16 byte overhead per object so 10,000,000 key value pairs has an ovehead of 320,000,000 bytes before you start storing data. This may not fit into a netbook or other small device for example. On a large server it may not be a problem. – Bruce Martin Nov 02 '12 at 21:38
  • actually with the properties object (with is just a hashmap behind the scenes) the overhead is ugly. On a 64 bits machine we are talking about 16B+8B for the long integer (key), 16B+4B+8B for an empty string (value), 16B+8B+8B+8B for a Map.Entry object, ~4B for the bucket in the HashMap. That's 96B for one property, assuming the value is an empty string. Altogther, that's 960MB of memory (again with only empty values). If the values are something like an XML document or something similar which can easily get into the KiB range, your machine can be considered dead :-) – LordOfThePigs Nov 02 '12 at 21:51
  • If I could downvote my answer I would. I'll leave it up so this discussion can be read. – jahroy Nov 02 '12 at 22:35
  • just curious, apart from the memory issue, can order be maintained in Properties object? – instanceOfObject Nov 03 '12 at 09:08
0

Assuming that key and value are separated by some delimiter.

  1. Read whole row using BufferedReader's readLine() method.
  2. Split String by your delimiter and have your map with you!!

This is easiest and extremely efficient (if not most) way!

See commons-io wrapper over it :)

Just don't flush() explicitly, let close() method do it :)

instanceOfObject
  • 2,936
  • 5
  • 49
  • 85
  • Yes, the guy wants to read 10'000'000 elements. Try to put that in memory in a hash map, and poof goes your heap! – LordOfThePigs Nov 02 '12 at 21:30
  • @jahroy Hope you know the reason by now! Properties object will take a lot of memory. I haven't used them but they are optimized internally and put data randomly. What if they want some order to be maintained? – instanceOfObject Nov 03 '12 at 09:07
0

Using DataInput/OutputStream wrapping a BufferedInput/OuputStream wrapping a FileInput/OutputStream yields acceptable performance for me. Thanks for all the suggestions.

kujawk
  • 837
  • 1
  • 9
  • 11