0

I have a file containing data that is meaningful only in chunks of certain size which is appended at the start of each chunk, for e.g.

{chunk_1_size}
{chunk_1}
{chunk_2_size}
{chunk_2}
{chunk_3_size}
{chunk_3}
{chunk_4_size}
{chunk_4}
{chunk_5_size}
{chunk_5}
.
.
{chunk_n_size}
{chunk_n}

The file is really really big ~ 2GB and the chunk size is ~20MB (which is the buffer that I want to have)

I would like to Buffer read this file to reduce the number to calls to actual hard disk.

But I am not sure how much buffer to have because the chunk size may vary.

pseudo code of what I have in mind:

while(!EOF) {
    /*chunk is an integer i.e. 4 bytes*/
    readChunkSize(); 
    /*according to chunk size read the number of bytes from file*/
    readChunk(chunkSize);   
}

If lets say I have random buffer size then I might crawl into situations like:

  1. First Buffer contains chunkSize_1 + chunk_1 + partialChunk_2 --- I have to keep track of leftover and then from the next buffer get the remaning chunk and concatenate to leftover to complete the chunk
  2. First Buffer contains chunkSize_1 + chunk_1 + partialChunkSize_2 (chunk size is an integer i.e. 4 bytes so lets say I get only two of those from first buffer) --- I have to keep track of partialChunkSize_2 and then get remaning bytes from the next buffer to form an integer that actually gives me the next chunkSize
  3. Buffer might not even be able to get one whole chunk at a time -- I have to keep hitting read until the first chunk is completely read into memory
Nick Div
  • 5,338
  • 12
  • 65
  • 127

2 Answers2

0

You don't have much control over the number of calls to the hard disk. There are several layers between you and the hard disk (OS, driver, hardware buffering) that you cannot control.

Set a reasonable buffer size in your Java code (1M) and forget about it unless and until you can prove there is a performance issue that is directly related to buffer sizes. In other words, do not fall into the trap of premature optimization.

See also https://stackoverflow.com/a/385529/18157

Community
  • 1
  • 1
Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
  • Based on what I have tested ~20 MB has been really fast on my machine and server machine as well. I am just not happy with all the band-aids I have between two consecutive reads of the file i.e. into the buffer. – Nick Div Feb 09 '17 at 17:13
0

you might need to do some analysis and have an idea of average buffer size, to read data. you are saying to keep buffer-size and read data till the chunk is done ,to have some meaning full data R u copying the file to some place else, or you sending this data to another place? for some activities Java NIO packages have better implementations to deal with ,rather than reading data into jvm buffers. the buffer size should be decent enough to read maximum chunks of data , If planning to hold data in memmory reading the data using buffers and holding them in memory will be still memory-cost operation ,buffers can be freed in many ways using basic flush operaitons. please also check apache file-utils to read/write data

Fryder
  • 413
  • 2
  • 7
  • 21
  • I dont mind using Java NIO but I am not too confident with them and would not be able to judge if they are the option in this case. I use this chunk of data to display in UI on a reporting dashboard. – Nick Div Feb 09 '17 at 17:42
  • try to consider a solution, where u stream all the data to an external app like elastic search and index them to show in the ui even distributed caches like hazel-cast/redis (clear the data once done) will hold lot of data.. it does not matter how many times it hits hard disk(as os with jvm and the java-program) will decide no-of io hits Note:- performance of NIO depends on the underlying OS and the operation u are trying to do – Fryder Feb 09 '17 at 18:36
  • I appreciate the suggestion but with the resources that I have right now, I cant really expand my options of implementations. – Nick Div Feb 10 '17 at 15:58