1

I am playing around with different ways to read numbers from a file and how efficient they are, here is one method I am currently using:

public static long getNumbers1() {
    final long startTime = System.nanoTime();

    try
    {
        String input = new String(Files.readAllBytes(file.toPath()));
        String[] stringNumbers = input.split("\\W");

        int[] numbers = new int[stringNumbers.length];
        for(int index = 1;index < stringNumbers.length;index++)
        {
            numbers[index] = Integer.parseInt(stringNumbers[index]);
        }
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }

    final long endTime = System.nanoTime();
    System.out.println(endTime + " | " + startTime + " | " + (endTime - startTime));
    return endTime - startTime;
}

file is declared at a global scope:

private static File file = new File(System.getProperty("user.dir") + "/data/numtest.txt");

This method is then run by the following means:

for (int index = 0;index < 10;index++)
{
    getNumbers1();
}

Printed in the console is the following:

15395409456370 | 15395397323226 | 12133144
15395410416178 | 15395410090933 | 325245
15395411137449 | 15395410835563 | 301886
15395411806342 | 15395411515427 | 290915
15395412389234 | 15395412097611 | 291623
15395412780660 | 15395412529737 | 250923
15395413168193 | 15395412912315 | 255878
15395413538738 | 15395413302679 | 236059
15395413948214 | 15395413665792 | 282422
15395414329376 | 15395414083762 | 245614

You will notice that the very first 'run time' value (the third value) is significantly greater in the first reading of the file than subsequent readings. No matter how many times I run the program, or how many times I make the for loop run (100 or 100000) the first value is always much greater. Why is this happening? Can I prevent it from happening? Is JAVA being smart and storing the values from the file and it isn't actually re-reading the file each time?

I am very curious...

Hurricane Development
  • 2,449
  • 1
  • 19
  • 40

2 Answers2

1

That would be disk caching at work. The first read is coming off the disk. The second read is coming out of the disk cache.

I've done performance testing on algorithms in the past. File IO and caching are always getting in the way or effecting results. You need to think about what sort of performance you're looking for.

If you are testing a complete system, you would keep the file IO in there, but you need to flush caches to get consistent results.

If you are testing an algorithm, keep all IO out of your timers.

Move your 'startTime = System.nanoTime()' after reading the file.

SteveS
  • 536
  • 1
  • 6
  • 10
  • So what the above people are saying about the JIT compiler doesn't play much of an effect (or isn't the main factor)? It is more so that the file data is cached? I am trying to test what reads the file faster, in other tests I use the following classes `Scanner, FileReader, BufferedInputStream` so I need to time the I/O. – Hurricane Development Oct 23 '14 at 18:33
  • Ah, you are testing file IO. In that case you should make 3 copies of the file, so each method is reading from a different disk location. Also, make sure your files are really big. Otherwise the overhead of just opening a file will over shadow your results. – SteveS Oct 23 '14 at 18:57
1

File IO uses a technique similar to demand paging to load parts of file into physical memory. A mapping of Disk File Pages to Physical Memory Pages is maintained by the paginated operating system.

When loading for the first time, Page faults are generated as the requested file page is not in physical memory. When you try loading it again, some pages will be found in physical memory, and will not require re-reading from the disk. If you make any changes to the pages in physical memory, a page out makes sure that the dirty pages are flushed to the disk.

You would have also noted this: When trying to open a file for the first time in your favorite text editor, it takes a while. When you close the file, and re-open it, it loads faster. This is due to the disk file page being already in physical memory.

The same thing happens when you re-read a file via Java. Its the OS that optimizes the re-reads, and not Java.

Manish Maheshwari
  • 4,045
  • 2
  • 15
  • 25