I'm already looking for a solution for quite a while, but I'm still struggling with concurrency and parallelization.
Background: There's an ETL process and I get a big csv (up to over a million rows). In production there will be live updates, too. I want to spell check each row. For that I use an adapted LanguageTool. The check method (with my customization inside) takes quite a while. I want to speed it up.
One aspect is of course the method itself, but I also want to simply check multiple rows at a time. The order of the rows is not important. The result is the corrected text and it should be written to a new csv file for further processing.
I found that ExecutorService
might be a reasonable choice, but since I'm not that familiar with it, some help would be appreciated.
That's how I use it so far in the ETL process:
private static SpellChecker spellChecker;
static {
SpellChecker tmp = null;
try {
tmp = new SpellChecker(...);
} catch (Exception e) {
e.printStackTrace();
}
spellChecker = tmp;
}
public static String spellCheck(String input) {
String output = input.replace("</li>", ".");
output = searchAVC.removeHtml(output);
try {
output = spellChecker.correctText(output);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return output;
}
My spellChecker is a custom library here and I create a static object of it (because instanciation of LanguageTool takes some time).
I want to parellize the execution of spellCheck
.
I've already read stuff like this: https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams What is the easiest way to parallelize a task in java? Write to text file from multiple threads?
I don't really know to combine all this information. What do I have to concern when reading the file? Writing the file? Processing the rows?