2

I'm trying to take a iterator of Strings, and turn it into an iterator of collections of strings based on an arbitrary splitting function.

So say I have

val splitter: String => Boolean = s => s.isEmpty

then I want it to take

val data = List("abc", "def", "", "ghi", "jkl", "mno", "", "pqr").iterator

and have

def f[A] (input: Iterator[A], splitFcn: A => Boolean): Iterator[X[A]]

where X can be any collection-like class you want, so long as it can be converted into a Seq, such that

f(data, splitter).foreach(println(_.toList))

outputs

    List("abc", "def")
    List("ghi", "jkl", "mno")
    List("pqr")

Is there a clean way to do this, that does not require collecting the results of the input iterator entirely into memory?

Nathan Kronenfeld
  • 473
  • 1
  • 5
  • 14

4 Answers4

1

This should do what you want:

  val splitter: String => Boolean = s => s.isEmpty
  val data = List("abc", "def", "", "ghi", "jkl", "", "mno", "pqr")

  def splitList[A](l: List[A], p: A => Boolean):List[List[A]] = {
    l match {
      case Nil => Nil
      case _ =>
        val (h, t) = l.span(a => !p(a))
        h :: splitList(t.drop(1), p)
    }
  }

  println(splitList(data, splitter))
//prints List(List(abc, def), List(ghi, jkl), List(mno, pqr))
Noah
  • 13,821
  • 4
  • 36
  • 45
  • what if the author meant that he didn't want to traverse the original iterator to the end before returning it from that method? – Ashalynd Apr 10 '14 at 22:02
  • I don't care about it being lists or sequences or whatnot, that was just my simple way of getting iterators. It should take Iterator[A], and return Iterator[X[A]], where X is something that can be collected into a sequence - I don't really care what. – Nathan Kronenfeld Apr 10 '14 at 22:10
  • Exactly, Ashalynd - I'm looking for the inverse of flatMap, and for exactly that reason - so I can do this as I run over the original iterator, without ever having to have the whole list in memory at once. – Nathan Kronenfeld Apr 10 '14 at 22:13
  • edited question to clear up the confusion made apparent by these comments, I hope – Nathan Kronenfeld Apr 10 '14 at 22:20
  • I updated the question a little, but it does what you're looking for. This is kind of 'reverse `flatmap`' but really the inverse of `flatMap` is `coFlatMap` but we really don't need to get into that... – Noah Apr 10 '14 at 22:58
0

UPDATE #2: Travis Brown answered another question using Scalaz-streams, an interesting package that might be helpful to you here. I am just starting to look at the package, but was quickly able to use it to read data from a file containing this:

abc
def

ghi
jkl
mno

pqr

and produce another file that looked like this:

Vector(abc, def, )
Vector(ghi, jkl, mno, )
Vector(pqr)

The library only holds the Vector being accumulated in memory. Here's my code (which should be considered dangerous, as I barely know anything about Scalaz-streams):

import scalaz.stream._
io.linesR("/tmp/a")
  .pipe( process1.chunkBy(_.nonEmpty) )
  .map( _.toString + "\n" )
  .pipe(text.utf8Encode)
  .to( io.fileChunkW("/tmp/b") )
  .run.run

Key to your task is the chunkBy(_.nonEmpty), which accumulates lines into a Vector until it hits an empty line. I have no idea at this point why you have to say run twice.

Old stuff below.

UPDATE #1: Ah! I just discovered the new constraint that it not all be read into memory. This solution isn't for you, then; you'd want Iterators or Streams.

I'm guessing that you'd want to enrich Traversable. And with the function in a separate argument list, the compiler can infer the types. For performance you would probably only want to make one pass over the data. And to avoid crashing with large datasets (and for performance), you wouldn't want any recursion that is not tail-recursion. Given this enricher:

implicit class EnrichedTraversable[A]( val xs:Traversable[A] ) extends AnyVal {
  def splitWhere( f: A => Boolean ) = {
    @tailrec
    def loop( xs:Traversable[A], group:Seq[A], groups:Seq[Seq[A]] ):Seq[Seq[A]] =
      if ( xs.isEmpty ) {
        groups :+ group
      } else {
        val x    = xs.head
        val rest = xs.tail
        if ( f(x) ) loop( rest, Vector(), groups :+ group )
        else        loop( rest, group :+ x, groups )
      }
    loop( xs, Vector(), Vector() )
  }
}

you can do this:

List("a","b","","c","d") splitWhere (_.isEmpty)

Here are some tests you might want to check out, to be sure the semantics are what you want (I personally like splits to behave this way):

val xs = List("a","b","","d","e","","f","g")    //> xs  : List[String] = List(a, b, "", d, e, "", f, g)
xs               splitWhere (_.isEmpty)         //> res0: Seq[Seq[String]] = Vector(Vector(a, b), Vector(d, e), Vector(f, g))
List("a","b","") splitWhere (_.isEmpty)         //> res1: Seq[Seq[String]] = Vector(Vector(a, b), Vector())
List("")         splitWhere (_.isEmpty)         //> res2: Seq[Seq[String]] = Vector(Vector(), Vector())
List[String]()   splitWhere (_.isEmpty)         //> res3: Seq[Seq[String]] = Vector(Vector())
Vector("a","b","","c") splitWhere (_.isEmpty)   //> res4: Seq[Seq[String]] = Vector(Vector(a, b), Vector(c))
Community
  • 1
  • 1
AmigoNico
  • 6,652
  • 1
  • 35
  • 45
0

I think Stream is what you want since they are evaluated lazily (not everything in memory).

def split[A](inputStream: Stream[A], splitter: A => Boolean): Stream[List[A]] = {
    var accumulationList: List[A] = Nil
    def loop(inputStream: Stream[A]): Stream[List[A]] = {
      if (inputStream.isEmpty) {
        if (accumulationList.isEmpty)
          Stream.empty[List[A]]
        else
          accumulationList.reverse #:: Stream.empty[List[A]]
      } else if (splitter(inputStream.head)) {
        val outputList = accumulationList.reverse
        accumulationList = Nil
        if (outputList.isEmpty)
          loop(inputStream.tail)
        else
          outputList #:: loop(inputStream.tail)
      } else {
        accumulationList = inputStream.head :: accumulationList
        loop(inputStream.tail)
      }
    }
    loop(inputStream)
  }

  val splitter = { s: String => s.isEmpty }
  val list = List("asdf", "aa", "", "fw", "", "wfwf", "", "")
  val stream = split(list.toStream, splitter)
  stream foreach println

The output is:

List(asdf, aa)
List(fw)
List(wfwf)

EDIT:
I have not looked at it in detail, but I guess my recursive method loop could be replaced by a foldLeft or foldRight.

toto2
  • 5,306
  • 21
  • 24
0

Here it is:

scala> val data = List("abc", "def", "", "ghi", "jkl", "mno", "", "pqr").iterator
data: Iterator[String] = non-empty iterator

scala> val splitter: String => Boolean = s => s.isEmpty
splitter: String => Boolean = <function1>


scala> def f[A](in: Iterator[A], sf: A => Boolean): Iterator[Iterator[A]] = 
         in.hasNext  match {
     |   case false => Iterator()
     |   case true  => Iterator(in.takeWhile(x => !sf(x))) ++ f(in, sf)
     | }
f: [A](in: Iterator[A], sf: A => Boolean)Iterator[Iterator[A]]

scala> f(data, splitter) foreach (x => println(x.toList))
List(abc, def)
List(ghi, jkl, mno)
List(pqr)
Eastsun
  • 18,526
  • 6
  • 57
  • 81