way to improve ngram generation?

Question

I know there are many threads in this name. I have a code to generate ngrams. But would like to know can it be improved for better speed when handling thousands of strings?

Example String="abcdefghijkl1245ty789"

public static String[] ngrams(String s) {
        int len=12;
        String[] parts = s.split("(?!^)");
        String[] result = new String[parts.length - len + 1];
        for(int i = 0; i < parts.length - len + 1; i++) {
           StringBuilder sb = new StringBuilder();
           for(int k = 0; k < len; k++) {
               sb.append(parts[i+k]);
           }
           result[i] = sb.toString();
        }
        return result;
    }

The above code gets a string,generates ngrmas of given length. In my case its 12.

If you have a vast number of input Strings with a reasonable likelihood of repetition of input Strings, then you may want to consider memoization of this method, since the output depends solely on the input. — Dancrumb, Jul 11 '13 at 13:54

Petr Janeček · Accepted Answer · 2013-07-11T14:09:16.417

6

Sure:

public static String[] ngrams(String str, int length) {
    char[] chars = str.toCharArray();
    final int resultCount = chars.length - length + 1;
    String[] result = new String[resultCount];
    for (int i = 0; i < resultCount; i++) {
        result[i] = new String(chars, i, length);
    }
    return result;
}

The changes I made:

instead of splitting via a regexp, I used String#toCharArray() which does a single array copy and is therefore much faster
instead of rebuilding the resulting Strings from a StringBuilder, I used an appropriate String constructor which, again, does only a single arraycopy
(not needed for performance, but still) I changed the method signature to have length as a parameter for my testing causes. Feel free to change it back - just make sure you rename the method from ngrams() to ngrams12() or something.

Or drop it everything altogether and use a naïve approach with String#substring() that does a similar work under the hood:

public static String[] ngramsSubstring(String str, int length) {
    final int resultCount = str.length() - length + 1;
    String[] result = new String[resultCount];
    for (int i = 0; i < resultCount; i++) {
        result[i] = str.substring(i, i+length);
    }
    return result;
}

By the way, if you ever had to use a regexp in the future, try compiling it once and reusing it instead of compiling it every time the method gets used. For example, your code would look like:

private static final Pattern EVERY_CHAR = Pattern.compile("(?!^)");

and then, in the method, instead of String#split, you'd use

String[] parts = EVERY_CHAR.split(str);

edited Jul 11 '13 at 14:09

answered Jul 11 '13 at 13:22

Petr Janeček

37,768
12
121
145

Feel free to ask any questions! – Petr Janeček Jul 11 '13 at 13:43
Thanks for the reply.Regarding substring option,i read on a some discussion that using substring causee creating new string object every time? so wont that cause heap space error when processing lots of substring operations for generating ngram? – Balaram26 Jul 11 '13 at 14:18
@Balaram26 I'm, not sure I understand correctly. Your solution as well as both mine create a new `String` instance every time they assign to `result[i]`. Additionally, your solution will create a `StringBuilder` object. There's some `char[]` copying included, too. Prior to Java 7 update 6, `substring()` shared the original char array and would therefore save a lot of memory (and `char[]` allocations and copies). – Petr Janeček Jul 11 '13 at 14:31
1

I just tried your substring method . The performance is excellent. Earlier it took 2 minutes to finish 200000 strings. With your method,it finishes 400000 strings in 2 minutes. Thanks for a providing detailed information. – Balaram26 Jul 11 '13 at 14:44
@Balaram26 Depending on your usage, you might get even faster by not really creating the String objects until you really have to and simply storing the start and end values of the n-grams. However, if you need to use/inspect all of those generated Strings right away, this is a no-go. If this solution is good enough for you, ok. If you're seeking more improvements, tell us what are you actually trying to achieve (either here, in a new question or e.g. on [Code Review](http://codereview.stackexchange.com/)) and maybe we'll come up with a better solution. – Petr Janeček Jul 11 '13 at 15:55
hi @PetrJaneček I am doing n-gram generation bi&tri grams. In my whole portion this part gets maximum time & I am already using same way as you suggested any way for more improvement? Thanks! – Divyang Shah Oct 27 '17 at 08:55
@DivyangShah No, after this we're entering the world of microbenchmarks and black magic. Raise a new question here, or preferably on [Code Review](https://codereview.stackexchange.com/), I'll take a look. – Petr Janeček Oct 27 '17 at 10:54
hi @PetrJaneček I just found much faster way than this. see the accepted answer here. https://stackoverflow.com/questions/8701610/quicker-way-to-detect-n-grams-in-a-string It converts your hours in minutes. Hope it will help! – Divyang Shah Oct 27 '17 at 14:03

way to improve ngram generation?

1 Answers1