4

I am writing a program that processes multiple files, each around 6 GB in size (big logfiles from a server). But I am only using 25% of my CPU (1 CPU thread out of 4 available) because I can't split the program in different threads, the work has to be done sequentially.

So, I was thinking about processing up to 4 files at the same time because I have a quad-core CPU but I am limited by random disk access performance of a HDD.

But in a few days, I'll will be using a laptop with SSD and 8 GB of ram. Would it be possible to map for instance the first 1 GB of each file in memory and process them in 4 different threads? And when I reach the end of the mapped file, I should be able to map the next 1 GB of the file in memory to proceed. Mapping 1 GB to memory should be not problem for a SSD I suppose because it gets around 400 MB/s read speed.

I know this can be done using FileChannel but I'm not sure about only mapping a part of the files.

Thanks, Siebe

Siebe
  • 53
  • 5
  • 1
    Have you tried memory mapping the whole files? I would have expected the OS to handle reading in appropriately small chunks. – slim Aug 29 '12 at 12:41

3 Answers3

1

When you memory map a file, the files is not actually transferred to memory (that would be the opposite of memory mapping).

Instead you are given a memory address which the kernel treats specially; when you access it, the kernel loads a page of memory with the file content. The pages then are unloaded when the OS decides to reclaim some memory; you can think of the mapped file somewhat as an extended swap space.

All this to say that, provided that you have enough memory addresses (that is, you have a 64bit OS and JVM), you can map a file bigger than the system memory.

Flavio
  • 11,925
  • 3
  • 32
  • 36
  • Thanks for the explanation, I didn't know that. But will it still give better performance when reading 4 files at once compared to normal InputStreamReader because you are still reading directly from disk? – Siebe Aug 29 '12 at 14:30
  • With your usage (sequential reading, no writing), I can't think of any theoretical reason to have a meaningful performance difference... you'll have to benchmark it. – Flavio Aug 29 '12 at 16:14
0

You can use FileChannel to map the entire file in memory at once. However, if you are reading the data sequentially and your processing is non trivial, using plain a FileInputStream in each of the threads may be much simpler to use and give you much the same performance.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    Note that the OP asked about mapping only a section of a file (which is definitely possible), and that the entire files are too big to be mapped with a single buffer. Agreed however that 4 FileInputStreams might be best. – parsifal Aug 29 '12 at 12:40
  • You can map multiple 1 GB buffers. Whether you do this at the start or progressively won't make much difference. I would expect doing it from the start would be simpler. – Peter Lawrey Aug 29 '12 at 12:45
0

You want to use a MemoryMappedByteBuffer which can be retrieved from a FileChannel. See this also: Memory-mapped files in Java and this: http://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html

Also as Peter pointed out, if your processing isn't cpu intensive then you might not gain much from moving the file into memory before processing it first. You might be better off just doing it in one fell swoop. Copying to memory won't be free as you know.

Community
  • 1
  • 1
anio
  • 8,903
  • 7
  • 35
  • 53