-1

I'm already looking for a solution for quite a while, but I'm still struggling with concurrency and parallelization.

Background: There's an ETL process and I get a big csv (up to over a million rows). In production there will be live updates, too. I want to spell check each row. For that I use an adapted LanguageTool. The check method (with my customization inside) takes quite a while. I want to speed it up.

One aspect is of course the method itself, but I also want to simply check multiple rows at a time. The order of the rows is not important. The result is the corrected text and it should be written to a new csv file for further processing.

I found that ExecutorService might be a reasonable choice, but since I'm not that familiar with it, some help would be appreciated.

That's how I use it so far in the ETL process:

private static SpellChecker spellChecker;
static {
    SpellChecker tmp = null;
    try {
      tmp = new SpellChecker(...);
    } catch (Exception e) {
        e.printStackTrace();
    }
    spellChecker = tmp;
  }

public static String spellCheck(String input) {
    String output = input.replace("</li>", ".");
    output = searchAVC.removeHtml(output);
    try {
        output = spellChecker.correctText(output);
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return output;
}

My spellChecker is a custom library here and I create a static object of it (because instanciation of LanguageTool takes some time). I want to parellize the execution of spellCheck.

I've already read stuff like this: https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams What is the easiest way to parallelize a task in java? Write to text file from multiple threads?

I don't really know to combine all this information. What do I have to concern when reading the file? Writing the file? Processing the rows?

Paprikamann
  • 103
  • 1
  • 8

1 Answers1

0

Create Reader class responsible is reading from File. Create Writer class responsible is writing from file. Create processor class responsible is processing. Now create a partitionner which responsible is read chunk by chunk and dispatch the batch of row to reader an reader will use processor to process and sent batch of row to writer. To run create a thread pool to execute in multi thread environment.

gati sahu
  • 2,576
  • 2
  • 10
  • 16