Scala: Functional aggregation of Seq[T] elements => Seq[Seq[T]] (preserving order)

Question

I'd like to aggregate compatible elements in a sequence, i.e. transform a Seq[T] into a Seq[Seq[T]] where elements in each subsequence are compatible with each other while the original seq order is preserved, e.g. from

case class X(i: Int, n: Int) {
  def canJoin(that: X): Boolean = this.n == that.n
  override val toString = i + "." + n
}
val xs = Seq(X(1, 1), X(2, 3), X(3, 3), X(4, 3), X(5, 1), X(6, 2), X(7, 2), X(8, 1))
/* xs = List(1.1, 2.3, 3.3, 4.3, 5.1, 6.2, 7.2, 8.1) */

want to get

val js = join(xs)
/* js = List(List(1.1), List(2.3, 3.3, 4.3), List(5.1), List(6.2, 7.2), List(8.1)) */

I've tried to do this in a functional way, but I got stuck halfway:

Doing with a while loop

def split(seq: Seq[X]): (Seq[X], Seq[X]) = seq.span(_ canJoin seq.head)
def join(seq: Seq[X]): Seq[Seq[X]] = {
  var pp = Seq[Seq[X]]()
  var s = seq
  while (!s.isEmpty) {
    val (p, r) = split(s)
    pp :+= p
    s = r
  }
  pp
}

With the split I'm satisfied, but the join seems to be a little bit too long.

In my opinion, that's a standard task. That leads me to the question:

Are there functions in the collections library that makes it possible to reduce code size?
Or is there perhaps a different approach to solve the task? Especially another approach than in Rewriting a sequence by partitioning and collapsing?

Replacing while loop with tail recursion

def join(xs: Seq[X]): Seq[Seq[X]] = {
  @annotation.tailrec
  def jointr(pp: Seq[Seq[X]], rem: Seq[X]): Seq[Seq[X]] = {
    val (p, r) = split(rem)
    val pp2 = pp :+ p
    if (r.isEmpty) pp2 else jointr(pp2, r)
  }
  jointr(Seq(), xs)
}

So we posted an almost identical answer/edit at the same time :D — Peter Schmitz, Nov 10 '11 at 21:23
My answer isn't tail recursive because the last operation is the Seq-concatenation. — Peter Schmitz, Nov 10 '11 at 21:24
I would place `jointr` into the definition of `join` as a helper function. — Peter Schmitz, Nov 10 '11 at 21:28
@Peter Schmitz: made jointr a helper def of join like suggested. — binuWADa, Nov 10 '11 at 21:37
As a small hint, only in case you don't know: `jointr(Seq(),xs)` is sufficient, types are inferred. — Peter Schmitz, Nov 10 '11 at 21:39
It's overkill, but doing something with element pairs is sufficiently common that I used it in my example on pimping the collections library: http://stackoverflow.com/questions/5410846. With that pimp, it would just be `xs.groupedWhile(_ canJoin _)` (that is, it solves the same problem). It's hardly simple, but it's maximally flexible (returns correct types, works on arrays, etc.). — Rex Kerr, Nov 10 '11 at 22:08
@Rex Kerr: Such an usage (`xs.groupedWhile(_ canJoin _)`) I really had in mind when I asked. And I had your linked question even in my favorites but hadn't yet worked through. It's a lot of code necessary but when vanishing in the library (one times) it's more than okay. I like this solution! — binuWADa, Nov 10 '11 at 23:41

score 8 · Accepted Answer · answered Nov 10 '11 at 21:21

8

def join(seq: Seq[X]): Seq[Seq[X]] = {
  if (seq.isEmpty) return Seq()
  val (p,r) = split(seq)
  Seq(p) ++ join(r)
}

answered Nov 10 '11 at 21:21

Peter Schmitz

5,824
4
26
48

Very nice (short and easy to understand) :-) – binuWADa Nov 10 '11 at 21:30

tenshi · Answer 2 · 2011-11-11T08:46:04.547

4

Here is foldLeft version:

def join(seq: Seq[X]) = xs.reverse.foldLeft(Nil: List[List[X]]) {
    case ((top :: group) :: rest, x) if x canJoin top => 
        (x :: top :: group) :: rest
    case (list, x) => (x :: Nil) :: list
}

and foldRight version (you don't need reverse the list in this case):

def join(seq: Seq[X]) = xs.foldRight(Nil: List[List[X]]) {
    case (x, (top :: group) :: rest) if x canJoin top => 
        (x :: top :: group) :: rest
    case (x, list) => (x :: Nil) :: list
}

edited Nov 11 '11 at 08:46

answered Nov 10 '11 at 22:14

tenshi

26,268
8
76
90

1

I've tried it, and it works. I just don't know yet exactly how it works because I'm not that familiar with lists, and can handle intuitively. But in a quiet moment I will look again more closely, the code does look nice already. – binuWADa Nov 10 '11 at 23:17

binuWADa · Answer 3 · 2011-11-13T20:08:16.003

Benchmark

Because I had too much time ;-), I asked myself, as it is with the runtime of the different approaches to get a feeling for whether or not a heavy construct is lurking behind a lightweight syntax.

So I created a micro-benchmark for measuring the running time for the three seqs

(1, 3, 3, 3, 1, 2, 2, 1)
(1, 2, 3, 4, 5, 6, 7, 8, 8, 8, 8, 8, 7, 6, 5, 4, 3, 3, 3, 2, 1, 2, 3)
(2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 8, 8, 8, 8, 8, 7, 6, 5, 4, 4, 4, 4, 3, 3, 3, 2, 1)

and got the following result:

Summary

EDIT: new results (begin):

On incorporation of results in my real project I came across inconsistencies regarding the benchmarks. So I repeated the benchmark again with more warm-up laps (now 1000), so the JIT compiler gets the most out of the code. Thus was shuffling the results and born a new favorite for me: X7 (pimp my lib) = pleasure without remorse. And the List version X8 (reverse.foldLeft) are now very fast too.

Nr (Approach)                      Running time (ns)  Contributor
X2 (poor.reference.impl)            in    15.202 ns
X1 (original while loop)            in     8.166 ns
X3 (tail recursion)                 in     7.473 ns
X4 (recursion with ++)              in     6.671 ns   Peter Schmitz
X5 (simplified recursion with ++)   in     6.161 ns   Peter Schmitz
X6 (foldRight)                      in     4.083 ns   tenshi
X7 (pimp my lib)                    in     1.677 ns   Rex Kerr
X8 (reverse.foldLeft)               in     1.349 ns   tenshi

EDIT: new results (end)

old results:

Nr (Approach)                      Running time (ns)  Contributor
X2 (poor.reference.impl)            in 2.972.015 ns
X7 (pimp my lib)                    in 1.185.599 ns   Rex Kerr
X3 (tail recursion)                 in 1.027.008 ns
X8 (reverse.foldLeft)               in   643.840 ns   tenshi
X6 (foldRight)                      in   608.112 ns   ""
X1 (original while loop)            in   564.726 ns
X4 (recursion with ++)              in   468.478 ns   Peter Schmitz
X5 (simplified recursion with ++)   in   447.699 ns   ""

Details

X2 (poor.reference.impl)

// in    15.202 ns
import collection.mutable.ArrayBuffer
def join2(seq: Seq[X]): Seq[Seq[X]] = {
  var pp = Seq[ArrayBuffer[X]](ArrayBuffer(seq(0)))
  for (i <- 1 until seq.size) {
    if (seq(i) canJoin seq(i - 1)) {
      pp.last += seq(i)
    } else {
      pp :+= ArrayBuffer(seq(i))
    }
  }
  pp
}

X1 (while loop)

// in     8.166 ns
def join(xs: Seq[X]): Seq[Seq[X]] = {
  var xss = Seq.empty[Seq[X]]
  var s = xs
  while (!s.isEmpty) {
    val (p, r) = split(s)
    xss :+= p
    s = r
  }
  xss
}

This is the original imperative approach at the beginning of the question.

X3 (tail recursion)

// in     7.473 ns
def join(xs: Seq[X]): Seq[Seq[X]] = {
  @annotation.tailrec
  def jointr(xss: Seq[Seq[X]], rxs: Seq[X]): Seq[Seq[X]] = {
    val (g, r) = split(rxs)
    val xsn = xss :+ g
    if (r.isEmpty) xsn else jointr(xsn, r)
  }
  jointr(Seq(), xs)
}

X4 (recursion with ++)

// in     6.671 ns
def join(seq: Seq[X]): Seq[Seq[X]] = {
  if (seq.isEmpty) return Seq()
  val (p, r) = split(seq)
  Seq(p) ++ join(r)
}

X5 (simplified recursion with ++)

// in     6.161 ns
def join(xs: Seq[X]): Seq[Seq[X]] = if (xs.isEmpty) Seq() else {
  val (p, r) = split(xs)
  Seq(p) ++ join(r)
}

The simplifying is nearly the same but still a little bit faster.

X6 (foldRight)

// in     4.083 ns
def join(xs: Seq[X]) = xs.foldRight(Nil: List[List[X]]) {
  case (x, (top :: group) :: rest) if x canJoin top => (x :: top :: group) :: rest
  case (x, list)                                    => (x :: Nil) :: list
}

Tried to avoid reverse but foldRight seems to be even a little bit worser than reverse.foldLeft for list's.

X7 (pimp my lib)

// in     1.677 ns
import collection.generic.CanBuildFrom
class GroupingCollection[A, C, D[C]](ca: C)(
    implicit c2i: C => Iterable[A],
    cbf: CanBuildFrom[C, C, D[C]],
    cbfi: CanBuildFrom[C, A, C]) {
  def groupedWhile(p: (A, A) => Boolean): D[C] = {
    val it = c2i(ca).iterator
    val cca = cbf()
    if (!it.hasNext) cca.result
    else {
      val as = cbfi()
      var olda = it.next
      as += olda
      while (it.hasNext) {
        val a = it.next
        if (p(olda, a)) as += a
        else { cca += as.result; as.clear; as += a }
        olda = a
      }
      cca += as.result
    }
    cca.result
  }
}
implicit def collections_have_grouping[A, C[A]](ca: C[A])(
  implicit c2i: C[A] => Iterable[A],
  cbf: CanBuildFrom[C[A], C[A], C[C[A]]],
  cbfi: CanBuildFrom[C[A], A, C[A]]) = {
  new GroupingCollection[A, C[A], C](ca)(c2i, cbf, cbfi)
}
// xs.groupedWhile(_ canJoin _)

X8 (reverse.foldLeft)

// in     1.349 ns
def join(xs: Seq[X]) = xs.reverse.foldLeft(Nil: List[List[X]]) {
  case ((top :: group) :: rest, x) if x canJoin top => (x :: top :: group) :: rest
  case (list, x)                                    => (x :: Nil) :: list
}

Conclusion

The different approaches (X1, X3, X4, X5, X6) all play in the same league.

Because X7 (pimp my lib) allows a very concise usage xs.groupedWhile(_ canJoin _) and cause the much necessary code can be hidden in the own util lib's, I've decided to use this in my real project.