3

I am implementing a class that should receive a large text file. I want to split it in chunks and each chunk to be hold by a different thread that will count the frequency of each character in this chunk. I expect with starting more threads to get better performance but it turns out performance is getting poorer. Here`s my code:

public class Main {

    public static void main(String[] args) 
    throws IOException, InterruptedException, ExecutionException, ParseException  
    {

        // save the current run's start time
        long startTime = System.currentTimeMillis();

        // create options 
        Options options = new Options();
        options.addOption("t", true, "number of threads to be start");

        // variables to hold options 
        int numberOfThreads = 1;

        // parse options
        CommandLineParser parser = new DefaultParser();
        CommandLine cmd;
        cmd = parser.parse(options, args);
        String threadsNumber = cmd.getOptionValue("t");
        numberOfThreads = Integer.parseInt(threadsNumber);

        // read file
        RandomAccessFile raf = new RandomAccessFile(args[0], "r");
        MappedByteBuffer mbb 
            = raf.getChannel().map(FileChannel.MapMode.READ_ONLY, 0, raf.length());

        ExecutorService pool = Executors.newFixedThreadPool(numberOfThreads);
        Set<Future<int[]>> set = new HashSet<Future<int[]>>();

        long chunkSize = raf.length() / numberOfThreads;
        byte[] buffer = new byte[(int) chunkSize];

        while(mbb.hasRemaining())
        {
            int remaining = buffer.length;
            if(mbb.remaining() < remaining)
            {
                remaining = mbb.remaining();
            }
            mbb.get(buffer, 0, remaining);
            String content = new String(buffer, "ISO-8859-1");
            @SuppressWarnings("unchecked")
            Callable<int[]> callable = new FrequenciesCounter(content);
            Future<int[]> future = pool.submit(callable);
            set.add(future);

        }

        raf.close();

        // let`s assume we will use extended ASCII characters only
        int alphabet = 256;

        // hold how many times each character is contained in the input file
        int[] frequencies = new int[alphabet];

        // sum the frequencies from each thread
        for(Future<int[]> future: set)
        {
            for(int i = 0; i < alphabet; i++)
            {
                frequencies[i] += future.get()[i];
            }
        }
    }

}

//help class for multithreaded frequencies` counting
class FrequenciesCounter implements Callable
{
    private int[] frequencies = new int[256];
    private char[] content;

    public FrequenciesCounter(String input)
    {
        content = input.toCharArray();
    }

    public int[] call()
    {
        System.out.println("Thread " + Thread.currentThread().getName() + "start");

        for(int i = 0; i < content.length; i++)
        {
            frequencies[(int)content[i]]++;
        }

        System.out.println("Thread " + Thread.currentThread().getName() + "finished");

        return frequencies;
    }
}
Warren Dew
  • 8,790
  • 3
  • 30
  • 44
barni
  • 49
  • 1
  • 2
  • 7
  • There are only so many bytes per second your hardware can deliver from the disk. It does not matter how much you request to read. – Henry Jun 24 '17 at 08:33
  • 1
    The disk isn't multithreaded. Your expectations are awry. – user207421 Jun 24 '17 at 08:33
  • So, if I save each chunk as a different file and then pass each file to a thread, should it get better? – barni Jun 24 '17 at 08:35
  • No. Not if they are still on the same disk. – Henry Jun 24 '17 at 08:38
  • 1
    @barni no. You would still have a single disk. That would probably make things worse. – JB Nizet Jun 24 '17 at 08:38
  • hmm.. and there's no way to make it faster on a single machine? – barni Jun 24 '17 at 08:41
  • @barni only if you has multiple physical drives. You can however read whole file in one thread and process it with multiple threads. – Matej Kormuth Jun 24 '17 at 08:42
  • In any case the only thing you're multithreading is the search. You are still loading the file all in the main thread, and this will be where the time is going, – user207421 Jun 24 '17 at 08:42
  • 1
    So a bit of theory: if your code itself has no bottlenecks other than the overhead of waiting for the network or hard disk or something, it's said to be **I/O bound**. At that point, the only way to get it to go faster is to improve the hardware attached to the machine itself. That, or start horizontally scaling to utilize more independent machines. – Qix - MONICA WAS MISTREATED Jun 24 '17 at 08:53

2 Answers2

3

As suggested in comments, you will (usually) do not get better performance when reading from multiple threads. Rather you should process the chunks you have read on multiple threads. Usually processing does some blocking, I/O operations (saving to another file? saving to database? HTTP call?) and your performance will get better if you process on multiple threads.

For processing, you may have ExecutorService (with sensible number of threads). use java.util.concurrent.Executors to obtain instance of java.util.concurrent.ExecutorService

Having ExecutorService instance, you may submit your chunks for processing. Submitting chunks would not block. ExecutorService will start to process each chunk at separate thread (details depends of configuration of ExecutorService ). You may submit instances of Runnable or Callable.

Finally, after you submit all items you should call awaitTermination at your ExecutorService. It will wait until processing of all submited items is finished. After awaitTermination returns you should call shutdownNow() to abort processing (otherwise it may hang indefinitely, procesing some rogue task).

Bartosz Bilicki
  • 12,599
  • 13
  • 71
  • 113
  • 1
    And if a single processing thread can keep up with the read speed, multithreading is pointless complication. – Kevin Krumwiede Jun 24 '17 at 20:41
  • He's already reading in one thread and processing in multiple threads, and he's already using an ExecutorService. This doesn't seem to answer the question. – Warren Dew Jun 24 '17 at 20:48
1

Your program is almost certainly limited by the speed of reading from disk. Using multiple threads does not help with this since the limit is a hardware limit on how fast the information can be transferred from disk.

In addition, the use of both RandomAccessFile and a subsequent buffer likely results in a small slowdown, since you are moving the data in memory after reading it in but before processing, rather than just processing it in place. You would be better off not using an intermediate buffer.

You might get a slight speedup by reading from the file directly into the final buffers and dispatching those buffers to be processed by threads as they are filled, rather than waiting for the entire file to be read before processing. However, most of the time would still be used by the disk read, so any speedup would likely be minimal.

Warren Dew
  • 8,790
  • 3
  • 30
  • 44