4

I think this is a fairly basic question concerning Java 8 streams, but I have a difficult time thinking of the right search terms. So I am asking it here. I am just getting into Java 8, so bear with me.

I was wondering how I could map a stream of tokens to a stream of n-grams (represented as arrays of tokens of size n). Suppose that n = 3, then I would like to convert the following stream

{1, 2, 3, 4, 5, 6, 7}

to

{[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7]}

How would I accomplish this with Java 8 streams? It should be possible to compute this concurrently, which is why I am interested in accomplishing this with streams (it also doesn't matter in what order the n-arrays are processed).

Sure, I could do it easily with old-fashioned for-loops, but I would prefer to make use of the stream API.

Tunaki
  • 132,869
  • 46
  • 340
  • 423
Jochem
  • 53
  • 5

4 Answers4

4

If you do not have random access to the source data, you can accomplish this with a custom collector:

List<Integer> data = Arrays.asList(1,2,3,4,5,6,7);

List<List<Integer>> result = data.stream().collect(window(3, toList(), toList()));  

Here's the source for window. It is parallel-friendly:

public static <T, I, A, R> Collector<T, ?, R> window(int windowSize, Collector<T, ?, ? extends I> inner, Collector<I, A, R> outer) {

    class Window {
        final List<T> left = new ArrayList<>(windowSize - 1);
        A mid = outer.supplier().get();
        Deque<T> right = new ArrayDeque<>(windowSize);

        void add(T t) {
            right.addLast(t);
            if (left.size() == windowSize - 1) {
                outer.accumulator().accept(mid, right.stream().collect(inner));
                right.removeFirst();
            } else {
                left.add(t);
            }
        }

        Window merge(Window other) {
            other.left.forEach(this::add);
            if (other.left.size() == windowSize - 1) { 
                this.mid = outer.combiner().apply(mid, other.mid);
                this.right = other.right;
            }
            return this;
        }

        R finish() {
            return outer.finisher().apply(mid);
        }
    }

    return Collector.of(Window::new, Window::add, Window::merge, Window::finish);
}
Misha
  • 27,433
  • 6
  • 62
  • 78
3

Such an operation is not really suited for the Stream API. In the functional jargon, what you're trying to do is called a sliding window of size n. Scala has it built-in with the sliding() method, but there is nothing built-in in the Java Stream API.

You have to rely on using a Stream over the indexes of the input list to make that happen.

public static void main(String[] args) {
    List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7);
    List<List<Integer>> result = nGrams(list, 3);
    System.out.println(result);
}

private static <T> List<List<T>> nGrams(List<T> list, int n) {
    return IntStream.range(0, list.size() - n + 1)
                    .mapToObj(i -> new ArrayList<>(list.subList(i, i + n)))
                    .collect(Collectors.toList());
}

This code simply makes a Stream over the indexes of the input list, maps each of them to a new list that is the result of getting the values of the list from i to i+n (excluded) and collect all that into a List.

Tunaki
  • 132,869
  • 46
  • 340
  • 423
  • So it's not possible to map a stream of tokens to a stream of n-grams directly? You need to have random access for a list/array of tokens? – Jochem Feb 02 '16 at 22:35
  • @Jochem, yes. Streams aren't designed for what you're trying to do. – Louis Wasserman Feb 02 '16 at 22:39
  • That's a shame. I have read a lot of the Java 8 tutorials and thought I had a decent understanding of what is what meant for. Are there any guidelines for when you should be using streams and when you should not be using them? I was so happy with my new "hammer" that I tried to use it for everything. :( – Jochem Feb 02 '16 at 22:43
  • 1
    @Jochem Basically, Streams are really suited when you consider each element independently of the rest (not tied between themselves). You can read [that answer](http://stackoverflow.com/a/20507988/1743880) by Stuart Marks (who is a JDK dev) that explains also. – Tunaki Feb 02 '16 at 22:49
  • @Holger This is indeed simpler... I edited but I prefer to wrap it in a new ArrayList to avoid keeping a reference to the input list. – Tunaki Feb 03 '16 at 12:40
  • Well, that depends on the operation. Since you collect into a `List`, that’s reasonable. But if you create the sublists just for performing another chained stream operation, the wrapping is unnecessary. The question doesn’t provide enough details regarding the use case… – Holger Feb 03 '16 at 12:42
0

Based on https://stackoverflow.com/a/20507988/11451863

the following should work

int n = 3;
List<Integer> intList = Arrays.asList(0, 1, 2, 3, 4, 5, 6, 7, 8, 9);

IntStream.rangeClosed(0, intList.size() - n)
        .mapToObj(i -> intList.subList(i, i+n))
        .collect(Collectors.toList());
vahbuna
  • 31
  • 3
0

This solution uses the reduce function. It does not work with parallelization.

The general idea is to create a list of n grams. For each element, we take the last n-gram, and slide the window using this element.

public static void main(String[] args) {
    int n = 3;
    // creating the initial list of tokens
    List<Integer> ints = IntStream.range(1,8).boxed().toList();

    /creating the first ngram
    List<List<Integer>> ngram = new ArrayList<>();
    ngram.add(ints.subList(0,n));

    // This is where the ngram list is created
    List<List<Integer>> ngrams = ints.stream().skip(n)
                   .reduce(ngram, WikiDictionaryExtractor::addWindow, (l1, l2)-> null);
}


public static List<List<Integer>> addWindow(List<List<Integer>> input, Integer newInt){
    List<Integer> res = new ArrayList(input.get(input.size()-1));
    res.remove(0);
    res.add(newInt);
    input.add(res);
    return input;
}
Shaharg
  • 971
  • 1
  • 11
  • 26