quicker way to detect n-grams in a string?

Question

I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence)

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

=> this bit of code takes by far the longest processing time (28 seconds for the detection of 1-grams, 2-grams, 3-grams and 4grams for my corpus: 4Mb of raw text), as compared to milliseconds for other operations (removal of stopwords etc.)

Does anybody know of solutions in Java which would go faster than the looping solution presented above? (I was thinking multithreading, use of collections, or maybe creative ways to split a string...?) Thanks!

Have you tried profiling what takes time? It seems as a lot of objects are created that would not be needed. Is it the split that takes time or creating the ngram objects or inserting them in the list? — Roger Lindsjö, Jan 02 '12 at 14:02
I'd start with not splitting into separate strings and then assembling them back, but instead scan for the delimiter and just remember the indexes, so for 3gram you keep track of delimiter n, n-1, n-2 and n-3. The 3gram starts a n-3 and ends at n. Then move n forward (m-3 is now what n-2 was and so on. — Roger Lindsjö, Jan 02 '12 at 14:05
thx @RogerLindsjö this looks promising! I tried a bit with a scanner, but I am not sure to understand your method correctly. If I keep track of the last 3 delimiters, how can I retrieve the corresponding 3 words when I reach n (after n-3, n-2, n-1). AFAICS there is no scanner method to reach for previous values (a sort of "scanner.previous()" if you will!). What is it that I don't get? Thx again! — seinecle, Jan 02 '12 at 16:50

Roger Lindsjö · Accepted Answer · 2012-01-03T14:38:52.040

You could try something like this:

public class NGram {

    private final int n;
    private final String text;

    private final int[] indexes;
    private int index = -1;
    private int found = 0;

    public NGram(String text, int n) {
        this.text = text;
        this.n = n;
        indexes = new int[n];
    }

    private boolean seek() {
        if (index >= text.length()) {
            return false;
        }
        push();
        while(++index < text.length()) {
            if (text.charAt(index) == ' ') {
                found++;
                if (found<n) {
                    push();
                } else {
                    return true;
                }
            }
        }
        return true;
    }

    private void push() {
        for (int i = 0; i < n-1; i++) {
            indexes[i] = indexes[i+1];
        }
        indexes[n-1] = index+1;
    }

    private List<String> list() {
        List<String> ngrams = new ArrayList<String>();
        while (seek()) {
            ngrams.add(get());
        }
        return ngrams;
    }

    private String get() {
        return text.substring(indexes[0], index);
    }
}

Testing on about 5mb of text it seems to perform about 10 times faster than the original code. The main difference here is that regex is not used to split and the ngram strings are not created by concatenation.

Update: This is the output I get when running on the text mentioned above, ngram 1-4. I run with 2GB of memory to decide the impact on GC during the runs. I ran multiple times to see the impact of the hotspot compiler.

Loop 01 Code mine ngram 1 time 071ms ngrams 294121
Loop 01 Code orig ngram 1 time 534ms ngrams 294121
Loop 01 Code mine ngram 2 time 016ms ngrams 294120
Loop 01 Code orig ngram 2 time 360ms ngrams 294120
Loop 01 Code mine ngram 3 time 082ms ngrams 294119
Loop 01 Code orig ngram 3 time 319ms ngrams 294119
Loop 01 Code mine ngram 4 time 014ms ngrams 294118
Loop 01 Code orig ngram 4 time 439ms ngrams 294118

Loop 10 Code mine ngram 1 time 013ms ngrams 294121
Loop 10 Code orig ngram 1 time 268ms ngrams 294121
Loop 10 Code mine ngram 2 time 014ms ngrams 294120
Loop 10 Code orig ngram 2 time 323ms ngrams 294120
Loop 10 Code mine ngram 3 time 013ms ngrams 294119
Loop 10 Code orig ngram 3 time 412ms ngrams 294119
Loop 10 Code mine ngram 4 time 014ms ngrams 294118
Loop 10 Code orig ngram 4 time 423ms ngrams 294118

thx Roger! I created a loop for (n = 1;n<=4;n++) and in the loop I instantiate your class with the constructor NGram(String text, int n): as a result I get +/- the same time of execution as the one mentioned in my question. Also, the results are not consistent with the ones I got from the code I quote in my question. Strange... — seinecle, Jan 03 '12 at 08:32
The last time I was testing was for 3-grams. I should have changed that back to n or made a comment for the value. Can you give me a sample that produces different results? I ran both the original code and my code on several different samples with the same results. BTW - the results won't be in the same order between the two programs. — karakuricoder, Jan 03 '12 at 12:33
Karakuri I did not test your code yet, my comment above on different output was addressed to Roger. I'd be glad if you could post a version of yours with n instead of 3! Also, I'm confused: the List returned by your class is "in" or "out"? Seems to me it is "in", but that this would be quite counter-intuitive as a name for an output? — seinecle, Jan 03 '12 at 14:28
When I check the lists generated they are exactly the same. Can you show some text for which they differ? — Roger Lindsjö, Jan 03 '12 at 14:31
When running with your text (joined all lines into one line) I get almost 570000 ngrams. The times vary a bit (a lot of GC), but my implementation takes about 100ms and yours 500ms. A lot of the time is spent in GC (when looping a lot of strings are generated and thrown away). — Roger Lindsjö, Jan 03 '12 at 15:05
Yes. The implementations behave differently for trailing white spaces. Are you running low on memory for your program? Your implementation takes about 2 seconds for 1-4 ngram on my laptop, and your questions says it takes 28 seconds. — Roger Lindsjö, Jan 03 '12 at 15:27

score 0 · Answer 2 · answered Jan 02 '12 at 19:17

Running about 5 megs of Lorus Ipsum text thru the code you supplied generally took about a little over 7 seconds to run for detecting 1 - 4 n-grams. I reworked the code to make a list of the longest n-gram and then iterate over this list producing lists of successively shorter ngrams. In testing it took about 2.6 seconds for the same text. Also, it took a lot less memory.

import java.util.*;

public class Test {

    public static List<String> ngrams(int max, String val) {
        List<String> out = new ArrayList<String>(1000);
        String[] words = val.split(" ");
        for (int i = 0; i < words.length - max + 1; i++) {
            out.add(makeString(words, i,  max));
        }
        return out;
    }

    public static String makeString(String[] words, int start, int length) {
        StringBuilder tmp= new StringBuilder(100);
        for (int i = start; i < start + length; i++) {
            tmp.append(words[i]).append(" ");
        }
        return tmp.substring(0, tmp.length() - 1);
    }

    public static List<String> reduceNgrams(List<String> in, int size) {
        if (1 < size) {
            List<String> working = reduceByOne(in);
            in.addAll(working);
            for (int i = size -2 ; i > 0; i--) {
                working = reduceByOne(working);
                in.addAll(working);
            }
        }
        return in;
    }

    public static List<String> reduceByOne(List<String> in) {
        List<String> out = new ArrayList<String>(in.size());
        int end;
        for (String s : in) {
            end = s.lastIndexOf(" ");
            out.add(s.substring(0, -1 == end ? s.length() : end));  
        }
        //the last one will always reduce twice - words 0, n-1 are in the loop this catches the words 1, n
        String s = in.get(in.size() -1);
        out.add(s.substring(s.indexOf(" ")+1));
        return out;
    }

    public static void main(String[] args) {
        long start;
        start = System.currentTimeMillis();
        List<String> ngrams = ngrams(3, "Your text goes here, actual mileage may vary");
        reduceNgrams(ngrams, 3);
        System.out.println(System.currentTimeMillis() - start);
    }
}

Thx karakuri! I don't understand your code quite well. Why "3" as an argument to reduceNGrams, if 4-n-grams are looked for? Thx! — seinecle, Jan 03 '12 at 08:36
The 3 is left over from testing. Substitute the desired depth. I should have changed this to a variable or provided a comment. — karakuricoder, Jan 03 '12 at 18:43

quicker way to detect n-grams in a string?

2 Answers2

Linked