2

Possible Duplicate:
What is the fastest way to read a large number of small files into memory?

I have a large number of small text files (29 bytes in size) but there are 1000+ of these.

I am trying to read in using BufferedReader but it seems to be quite slow considering that all the files are stored locally. We have tried with a very small number of these files (etc 12) and the reading is almost instantaneous.

Is there any more efficient way of reading or there is a bottleneck somewhere in the buffer?

Thanks!

Community
  • 1
  • 1
Eugene
  • 1,013
  • 1
  • 22
  • 43
  • 1
    Can you zip all the files and use some sort of union-fs-style virtual file system? – Kerrek SB Sep 04 '12 at 06:54
  • @Keppil - but see my Answer for a rebuttal to that Question. – Stephen C Sep 04 '12 at 07:09
  • If you can give us bit more context about your problem maybe geniuses on SO can come up with alternate solutions that can speed things up. For example, instead of generate 1000+ small files, can you keep appending to the same file? Or, maybe using some concurrent programming technique can help to boost your performance? – Alvin Sep 04 '12 at 07:25
  • @StephenC: The test results he was showing in the linked question seems pretty impressive though, I think it would be well worth a try. – Keppil Sep 04 '12 at 07:43
  • @Keppil - from the answer "I ran it on the rt.jar class files, extracted to the hard drive, this is under Windows 7 beta x64. That is 16784 files with a total of 94,706,637 bytes.". That's an average size of 5642 bytes, and that is huge compared with the OP's use-case. – Stephen C Sep 04 '12 at 07:58
  • @StephenC: I'd say that both file sizes are really small, and stand by my opinion that it is well worth testing. – Keppil Sep 04 '12 at 08:06
  • @Keppil - I'd be interested to hear what results you get :-) – Stephen C Sep 04 '12 at 10:11

2 Answers2

2

The bottleneck is most likely in opening the files, and there's not a lot you can do about it.

(The Q&A linked in the comments suggests using a memory mapped file. But this directly contradicts the Javadoc which states that the overheads of setting up the mapping are significant, and you are only going to get a pay-off for large files. And a bit of math shows that his benchmark uses files with an average size of 5642 bytes ... which is huge compared to your file size of 29 bytes.)

The only way you will get traction on this is if you combine the little files into a big one using a light-weight format that can be read / loaded efficiently. ZIP is not the best idea unless you avoid compression.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks for the suggestion! The other consideration would be that the files are dynamically generated, and the information from every file has to be displayed at moment of creation. I cannot afford to combine all files into a single large file and then display it later. – Eugene Sep 04 '12 at 07:09
  • @user990639 - the fact that they are dynamically generated doesn't mean that you can't combine them. It just means that you need to generate them differently. – Stephen C Sep 04 '12 at 07:54
2

Opening and closing files is very slow, especially if you have a HDD. A typically HDD has a seek time of 8 ms or about 125 per second. As the files are so small, reading the content doesn't really matter.

I agree that memory mapped files only make sense if you have

  • a fast disk subsystem where your bottleneck is not your drive.
  • file are huge (GB to TB)

BTW: If you used an SSD, they can perform around 80K to 230K IOPS which is quite a bit faster.

The only other solution is to combine the files. Reading a 64 KB files takes about the same time as reading a 29 byte file but can store thousands of times more data (and requires thousands of times less files)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130