1

I'm testing processing of a large file (10.000.100 rows) with java.

I wrote a piece of code which reads from the file and spawns a specified number of Threads (at most equal to the cores of the CPU) which, then, print the content of the rows of the file to the standard output.

The Main class is like the following:

public class Main
{
    public static void main(String[] args)
    {
        int maxThread;
        ArrayList<String> linesForWorker = new ArrayList<String>();


        if ("MAX".equals(args[1]))
            maxThread = Runtime.getRuntime().availableProcessors();
        else
            maxThread = Integer.parseInt(args[1]);

        ExecutorService executor = Executors.newFixedThreadPool(maxThread);

        String readLine;
        Thread.sleep(1000L);

        long startTime = System.nanoTime();

        BufferedReader br = new BufferedReader(new FileReader(args[0]));


        do
        {
            readLine= br.readLine();

            if ("X".equals(readLine))
            {
                executor.execute(new WorkerThread((ArrayList) linesForWorker.clone()));

                linesForWorker.clear(); // Wrote to avoid storing a list with ALL the lines of the file in memory
             }
             else
             {
                 linesForWorker.add(readLine);
             }

        }
        while (readLine!= null);

       executor.shutdown();
       br.close();

       if (executor.awaitTermination(1L, TimeUnit.HOURS))
           System.out.println("END\n\n");

      long endTime = System.nanoTime();

      long durationInNano = endTime - startTime;

      System.out.println("Duration in hours:" + TimeUnit.NANOSECONDS.toHours(durationInNano));
      System.out.println("Duration in minutes:" + TimeUnit.NANOSECONDS.toMinutes(durationInNano));
      System.out.println("Duration in seconds:" + TimeUnit.NANOSECONDS.toSeconds(durationInNano));
      System.out.println("Duration in milliseconds:" + TimeUnit.NANOSECONDS.toMillis(durationInNano));

    }
}

And then the WorkerThread class is structured as following:

class WorkerThread implements Runnable
{
    private List<String> linesToPrint;

    public WorkerThread(List<String> linesToPrint) { this.linesToPrint = linesToPrint; }

    public void run()
    {
        for (String lineToPrint : this.linesToPrint)
        {
          System.out.println(String.valueOf(Thread.currentThread().getName()) + ": " + lineToPrint);
        }

        this.linesToPrint = null; // Wrote to help garbage collector know I don't need the object anymore
    }
}

I run the application specifing 1 and "MAX" (i.e. number of CPUs core, which is 4 in my case) as the maximum thread of the FixedThreadPool and I experienced:

  • An execution time of about 40 minutes when executing the application with 1 single thread in the FixedThreadPool.
  • An execution time of about 44 minutes when executing the application with 4 threads in the FixedThreadPool.

Someone could explain me this strange (at least for me) behaviour? Why multithreading didn't help here?

P.S. I have SSD on my machine

EDIT: I modified the code so that the Threads now create a file and write their set of lines to that file in the SSD. Now the execution time has diminished to about 5 s, but I still have that the 1-thread version of the program runs in about 5292 ms, while the multithreaded (4 threads) version runs in about 5773 ms.

Why the multithreaded version still lasts more? Maybe every thread, even to write his "personal" file, has to wait the other threads to release the SSD resource in order to access it and write?

ela
  • 325
  • 2
  • 10
  • 3
    You read the files lines with one thread. Reading a files content from multiple threads from either a HDD or SSD won't benefit you (a HDD would crumble and performance way worse though), so thats fine. `System.out.println` is a synchronized function which may massively bottleneck your processing threads. What amount of lines are we talking about? 40 minutes seems extreme. You are probably IO bound or suffer from the contention about `System.out.println` – roookeee Jul 04 '19 at 09:48
  • 1
    Note also that cloneing your ArraryList is unnecessary overhead. Just pass the ArrayList as is , and assign a new instance of ArrayList instead of clear()ing the old one. You also might create your ArrayList with a bigger size than the default of 16. – Gyro Gearless Jul 04 '19 at 09:51
  • The only part of the code you posted that is multi-threaded is the printing of the file contents and even that is not really multi-threaded since you aren't launching the threads at the same time. It could be that a given thread completes before the next thread even starts. In this case the overhead of launching extra threads would only add to the code execution time. – Abra Jul 04 '19 at 09:52
  • Hi, ty for answering! The files contains 10.000.100 rows, as I wrote at the beginning of the question – ela Jul 04 '19 at 09:53
  • I would suggest you to buffer the console output (a naive approach is described [here](https://stackoverflow.com/a/31118560/4934324) but I would do it in memory with some sort of thread safe list) before writing to `System.out` or use another way to log your data with an appropriate logging framework that e.g. supports asynchronous logging. If your code really is 1:1 what you posted it should not as long as you described (only if you are running on some really old machine or have a broken ssd). My suggestion: don't use `System.out.println` in production – roookeee Jul 04 '19 at 09:59
  • What made you think it would be faster? NB Your read loop wil enqueue a null pointer, which will ultimately throw NPE, at end of file. You need a `while`, not a `do-while`. – user207421 Jul 04 '19 at 10:08
  • @user207421 I was trying to make elaboration multithreaded, so that when a thread processes (in this test the elaboration is only about printing the row to standard output) a bunch of rows, the other one can process another bunch etc... P.S. No, it won't. when readLine will be null it will only pass through "X".equals and linesForWorker.add() which are null safe methods – ela Jul 04 '19 at 10:34
  • What made you think multiple threads printing to the same output would be faster than one? And yes it will indeed *enqueue* a null, which is what I actually said, and I also said that it will *subsequently* throw an NPE when processed, which means *after* the enqueuing, obviously, which is also therefore after the `equals()` line. – user207421 Jul 04 '19 at 11:48
  • Thank you for pointing out that, actually, the threads can't write in parallel to the standard output, which points out the fact that the writing part of the application is to be changed. Regarding the NPE: no, again, it won't be thrown. When readLine will be null the check "X".equals(readLine) will fail and the list containing only null won't be processed. – ela Jul 04 '19 at 13:15

0 Answers0