0

I am trying to read a file and add each line to a list.

Simple drawing explaining the goal

Main class -

public class SimpleTreadPoolMain {

  public static void main(String[] args) {
    ReadFile reader = new ReadFile();
    File file = new File("C:\\myFile.csv");
    try {
        reader.readFile(file);
    } catch (IOException e) {
        e.printStackTrace();
    }
  }
}

Reader class -

public class ReadFile {

ExecutorService executor = Executors.newFixedThreadPool(5);//creating a pool of 5 threads

List<String> list = new ArrayList<>();

void readFile(File file) throws IOException {
    try (BufferedReader br = new BufferedReader(new FileReader(file))) {
        String line;
        while ((line = br.readLine()) != "") {
            Runnable saver = new SaveToList(line,list);  
            executor.execute(saver);//calling execute method of ExecutorService 
        }
    }

    executor.shutdown();  
    while (!executor.isTerminated()) {   }  

}

}

Saver class -

public class SaveToList<E> implements Runnable{

List<E> myList;

E line;

public SaveToList(E line, List<E> list) {
    this.line = line;
    this.myList = list;
}

public void run() {
    //modify the line
    myList.add(line);

}
}

I tried to have many saver threads to add in to a same list instead of one saver adding to the list one by one. I want to use threads because I need to modify the data before adding to the list. So I assume modifying the data would take up some time. So paralleling this part would reduce the time consumption, right?

But this doesn't work. I am unable to return a global list which includes all the values from the file. I want to have only one global list of values from the file. So the code definitely should change. If one can guide me it would be greatly appreciated.

Even though adding one by one in a single thread would work, using a thread pool would make it faster, right?

Hasith
  • 73
  • 9
  • How do you think you can add to a list, other than one-by-one? One thing has to go in; then the next; then the next. – Andy Turner Apr 28 '17 at 12:19
  • 1
    Could you elaborate on what doesn't work? _But this doesn't work._ is a bit to general – Robin Topper Apr 28 '17 at 12:20
  • So are you saying that using few threads won't make a difference in the time consumption of the insertion part ? – Hasith Apr 28 '17 at 12:21
  • that's not what he is saying at all – Tim Apr 28 '17 at 12:23
  • @RobinTopper I am unable to return a global list which includes all the values from the file. Obviously the code I have written cannot seems to support a global list. I cannot figure out how to implement it – Hasith Apr 28 '17 at 12:27
  • @AndyTurner I'm not sure if I understand what you are is saying. I get that in a list that it should be inserted in a order. If I can't use threads make that faster, can I use any other collection like Maps or Sets? – Hasith Apr 28 '17 at 12:30
  • 2
    Note that `while ((line = br.readLine()) != "") {` won't do what you think. See [How do I compare strings in Java?](http://stackoverflow.com/questions/513832/how-do-i-compare-strings-in-java) – Andy Turner Apr 28 '17 at 12:32
  • @AndyTurner thanks for pointing that out. so '("".equals(line = br.readLine()))' would be correct ? – Hasith Apr 28 '17 at 12:49
  • @Hasith just use `isEmpty()`. – Andy Turner Apr 28 '17 at 13:03

2 Answers2

3

Using multiple threads won't speed anything up here.

You are:

  • Reading a line from a file, serially.
  • Creating a runnable and submitting it into a thread pool
  • The runnable then adds things into a list

Given that you're using an ArrayList, you need to synchronize access to it, because you're mutating it from multiple threads. So, you are adding things into the list serially.

But even without the synchronization, the time taken for the IO will far exceed the time taken to add the string into the list. And adding in multithreading is just going to slow it down more, because it's doing work to construct the runnable, submit it to the thread pool, schedule it, etc.

It's simpler just to miss out the whole middle step:

  • Read a line from a file, serially.
  • Add the list to the list, serially.

So:

try (BufferedReader br = new BufferedReader(new FileReader(file))) {
    String line;
    while (!(line = br.readLine()).isEmpty()) {
        list.add(line);
    }
}
Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • +1 for the answer. Do you have any idea to make it faster, since I am reading very large CSV files. Before adding to the list I am modifying the values. If I use a HashMap or a Set, would it make it sense to use threads? – Hasith Apr 28 '17 at 12:34
  • "If I use a HashMap or a Set" Again, you need to synchronize them. – Andy Turner Apr 28 '17 at 12:35
  • This doesn't look like it answers the question, but then I don't really understand what is it that the question asks. And things that you are saying are technically correct. Although you could speed up IO by using multiple channels and random access files, I believe. – M. Prokhorov Apr 28 '17 at 12:35
  • @Hasith, if you are reading very large files, I'd honestly be more concerned with storing all their contents in program memory, not with speed with which you can do it. A "very large file" to me far exceeds in size the amount of memory any program can have. – M. Prokhorov Apr 28 '17 at 12:38
  • @M.Prokhorov Before adding to the list I need to change the line read from the file. So assuming that it takes a little time to do this modification, I feel like if I can use multiple threads to do this modifications and then insert it to the list, that would make it faster than modifying one line at a time. Hope this is clearer.. :) Filesizes are less than 2GB – Hasith Apr 28 '17 at 12:41
  • 2
    @Hasith, please edit the question to mention that you're modifying lines, not just adding them to a list. Parallelizing the modification may make more sense than parallelizing adding. –  Apr 28 '17 at 12:47
  • @Hasith To figure out of parallelizing makes sense, read in 1GB worth of data (make sure it fits in your physical memory, no paging), then apply your modification code. Time the modification. Compare it with the time it takes to read the file. If the modification time is comparable to reading time, you can think about parallelization. Otherwise, don't bother. –  Apr 28 '17 at 12:49
  • The theory behind Arkadiy's point is called [Amdahl's Law](https://en.wikipedia.org/wiki/Amdahl%27s_law), which can be used to determine the theoretically-possible speed up by parallelizing. – Andy Turner Apr 28 '17 at 12:51
  • 1
    And I would guess that there is almost no way modifying single in-memory `String` takes comparable time to reading that same `String` from a file. That would take a very particular set of operations, or, say, several supplementary blocking I/O operations done on like a database or something. – M. Prokhorov Apr 28 '17 at 12:53
0

You should in fact try if it's worth using multi threading in you application, just compare how much time it takes to read the whole file without any processing on rows done, and compare it with the time it takes to process serially the whole file.

If your process is not too complex my guess is it is not worth to use multi threading.

If you find that the time it takes is much more then you can think about using one or more threads to do the computations.

If so, you could use Futures to process batches of input strings or maybe you could use a thread safe Queue to send string to another process.

private static final int BATCH_SIZE = 1000;

public static void main(String[] args) throws IOException {

    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("big_file.csv"), "utf-8"));


    ExecutorService pool = Executors.newFixedThreadPool(8);
    String line;
    List<String> batch = new ArrayList<>(BATCH_SIZE);
    List<Future> results = new LinkedList<>();
    while((line=reader.readLine())!=null){
        batch.add(line);
        if(batch.size()>=BATCH_SIZE){
            Future<Object> f = noWaitExec(batch, pool);
            results.add(f);
            batch = new ArrayList<>(BATCH_SIZE);
        }
    }
    Future<List> f = noWaitExec(batch,pool);
    results.add(f);

    for (Future future : results) {
        try {
            Object object = future.get();
            // Use your results here 
        } catch (Exception e) {
            // Manage this....
        }
    }


}
private static Future<List> noWaitExec(final List<String> batch, ExecutorService pool) {
    return pool.submit(new Callable<List>() {
        public List call() throws Exception {
            List result = new ArrayList<>(batch.size());
            for (String string : batch) {
                result.add(process(string));
            }
            return result;
        }

    });
}

private static Object process(String string) {
    // Your process .... 
    return null;
};

There are many other possible solutions (Observables, ParallelStreams, Pipes, CompletableFutures ... you name it), still I think that most of the time spent is the time it takes to read the file, just using a BufferedInputStream to read the file with a big enough buffer could cut your times more then parallel computing.

minus
  • 2,646
  • 15
  • 18