I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence)
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}
=> this bit of code takes by far the longest processing time (28 seconds for the detection of 1-grams, 2-grams, 3-grams and 4grams for my corpus: 4Mb of raw text), as compared to milliseconds for other operations (removal of stopwords etc.)
Does anybody know of solutions in Java which would go faster than the looping solution presented above? (I was thinking multithreading, use of collections, or maybe creative ways to split a string...?) Thanks!