14

I'm trying to 'group' a string into segments, I guess this example would explain it more succintly

scala> val str: String = "aaaabbcddeeeeeeffg"
... (do something)
res0: List("aaaa","bb","c","dd","eeeee","ff","g")

I can thnk of a few ways to do this in an imperative style (with vars and stepping through the string to find groups) but I was wondering if any better functional solution could be attained? I've been looking through the Scala API but there doesn't seem to be something that fits my needs.

Any help would be appreciated

djhworld
  • 6,726
  • 4
  • 30
  • 44

9 Answers9

22

You can split the string recursively with span:

def s(x : String) : List[String] = if(x.size == 0) Nil else {
    val (l,r) = x.span(_ == x(0))
    l :: s(r) 
}

Tail recursive:

@annotation.tailrec def s(x : String, y : List[String] = Nil) : List[String] = {
    if(x.size == 0) y.reverse 
    else {
        val (l,r) = x.span(_ == x(0))
        s(r, l :: y)
    }
}
Thomas Jung
  • 32,428
  • 9
  • 84
  • 114
  • @Paul - No, I don't think so. The match would take more space, and a check against size is pretty clear. @Thomas - See my comment to Martin Ring's answer. I like this answer, but I do want to point out the drawbacks of this style of method. – Rex Kerr Mar 09 '11 at 17:37
  • @Paul You could implement it as Martin has done it. I actually did this but it was not more readable. – Thomas Jung Mar 09 '11 at 18:00
  • Thanks for your reply, I don't really have much experience with the `span` method so reading something like this has given me some great insight into your thought process behind the solution for this question, so thanks! – djhworld Mar 09 '11 at 23:10
  • Any ideas how to make this tail recursive? I'm getting stack overflow errors with large inputs – djhworld Mar 10 '11 at 14:58
16

Seems that all other answers are very concentrated on collection operations. But pure string + regex solution is much simpler:

str split """(?<=(\w))(?!\1)""" toList

In this regex I use positive lookbehind and negative lookahead for the captured char

tenshi
  • 26,268
  • 8
  • 76
  • 90
  • 2
    +1 - very nice! I always forget about the amazing power of regexes. – Rex Kerr Mar 09 '11 at 19:13
  • 9
    And the amazing unreadability. Can you really say you understand that one without careful study. – The Archetypal Paul Mar 09 '11 at 19:32
  • Is this "functional"? @Paul That's what comments are for – Raphael Mar 09 '11 at 20:02
  • @Raphael: I think, that this code is functional. Though I don't think that term "functional" is strictly defined and means the same thing for everybody... or maybe it's better to say that I don't see any reason why this code should be considered imperative or non-functional. – tenshi Mar 09 '11 at 20:13
  • Is the regex matcher implemented in a functional way? But you are probably right, the term has not been defined sufficiently in this question. – Raphael Mar 09 '11 at 20:41
  • @Raphael: I think the better question would be whether this code or `split` function are referentially transparent. And they are referentially transparent. – tenshi Mar 09 '11 at 20:47
  • I was thinking maybe a regex expression might be a good way to approach this but I'm a bit perplexed by the complexity of regular expressions. Whilst the above expression might be succinct it makes no sense to me whatsoever - but thanks for the answer it's certainly very interesting! – djhworld Mar 09 '11 at 23:01
  • @Raphael, I see no comments :) And I really dislike code that needs a comment just to explain what it does. So yes, regular expressions work, and are good for a codegolf style solution, but I prefer Kernighan and Pike's "say what you mean, simply and directly. " – The Archetypal Paul Mar 10 '11 at 09:52
  • @Paul, This code is cryptic only because you are not used to regular expressions. I need to use regular expressions often and I do not need any comment to understand it. Even the solution of Martin Ring needs careful study if you are not used to this style of programming. – Marcello Nuccio Mar 10 '11 at 13:34
  • 1
    @marcello, but I am familiar with regexps. Why did you assume I am not? I think this use counts as advanced regexp use (lookahead and lookbehind) and will leave most readers behind. I would claim Martin's solution is more comprehensible to more readers. – The Archetypal Paul Mar 10 '11 at 16:51
  • 3
    @Paul: I agree, but consider this: If you used the above code for real, you would put it in a method, give that a nice name `splitByGroup`, document and unit-test it. The guy who wants to split strings does not have to understand the regex, as long as the implementer of `splitByGroup` does. – Raphael Mar 10 '11 at 17:50
13
def group(s: String): List[String] = s match {
  case "" => Nil
  case s  => s.takeWhile(_==s.head) :: group(s.dropWhile(_==s.head))
}

Edit: Tail recursive version:

def group(s: String, result: List[String] = Nil): List[String] = s match {
  case "" => result reverse
  case s  => group(s.dropWhile(_==s.head), s.takeWhile(_==s.head) :: result)
}

can be used just like the other because the second parameter has a default value and thus doesnt have to be supplied.

Martin Ring
  • 5,404
  • 24
  • 47
  • @Rex Kerr why would it be `O(n^2)`? I think there are 2 comparison operations for each character of the string which is less then optimal (better solution is to use `span` like Thomas suggested), but that still is `O(n)` or not? Am i missing something? – Martin Ring Mar 09 '11 at 17:42
  • @Rex I'd say you consume every character once (okay actually twice in this implementation). This would be O(n). (Yes, you could make it tail-recursive. This is an exercise not a problem.) – Thomas Jung Mar 09 '11 at 17:56
  • @Martin, @Thomas - If every character is different, you generate `n` strings of average length `n/2`, for `O(n^2)` work total. – Rex Kerr Mar 09 '11 at 19:11
  • 1
    @Rex Creating a substring of length `k` is `O(1)` and not `O(k)`. (See [here](http://stackoverflow.com/questions/4679746/time-complexity-of-javas-substring)) – Martin Ring Mar 09 '11 at 19:41
  • @Martin - I'm wrong, but not for the reason you stated; I didn't think that the `dropWhile` managed to make it all the way back to `substring`. But I checked the source and it looks like it does. So there's only the stack overflow problem, not the `O(n^2)` problem. – Rex Kerr Mar 09 '11 at 20:04
  • (Deleted my initial misleading comment.) – Rex Kerr Mar 09 '11 at 20:42
  • This is certainly the most readable answer I've seen for this and it most definitely my 'accepted' approach to solving my problem. I'm still learning the functional paradigm so it's certainly very beneficial for me to read something like this! – djhworld Mar 09 '11 at 23:04
  • Any ideas how to make this tail recursive? I'm getting stack overflow errors with large inputs – djhworld Mar 10 '11 at 14:56
4

Make it one-liner:

scala>  val str = "aaaabbcddddeeeeefff"
str: java.lang.String = aaaabbcddddeeeeefff

scala> str.groupBy(identity).map(_._2)
res: scala.collection.immutable.Iterable[String] = List(eeeee, fff, aaaa, bb, c, dddd)

UPDATE:

As @Paul mentioned about the order here is updated version:

scala> str.groupBy(identity).toList.sortBy(_._1).map(_._2)
res: List[String] = List(aaaa, bb, c, dddd, eeeee, fff)
Rustem Suniev
  • 1,149
  • 9
  • 8
  • 1
    I think it's clear from the OP's example that order of the segments matters. – The Archetypal Paul Mar 09 '11 at 16:44
  • @Paul The order wasn't mentioned by anyway I updated the code – Rustem Suniev Mar 09 '11 at 16:57
  • 1
    this would result in `List(bb, aaa)` for `"aabba"`... wouldnt it? Dont know if that is what djhworld wanted. – Martin Ring Mar 09 '11 at 17:26
  • @Martin Ring - Well yeah it will be List(aaa,bb). I guess the way the question was asked it can be interpreted in many ways. – Rustem Suniev Mar 09 '11 at 17:40
  • I'm a bit confused, where does this 'identity' thing come from? Thanks for your answer but it's not quite what I'm after, I'm interesting in the specific order the string comes in in the first place – djhworld Mar 09 '11 at 23:06
  • `identity` is defined in Predef and always in scope. Its definition is `def identity[A](x: A): A = x`, which simply returns what it was given. – michael.kebe Mar 10 '11 at 09:56
1

You could use some helper functions like this:

val str = "aaaabbcddddeeeeefff"

def zame(chars:List[Char]) = chars.partition(_==chars.head)

def q(chars:List[Char]):List[List[Char]] = chars match {
    case Nil => Nil
    case rest =>
        val (thesame,others) = zame(rest)
        thesame :: q(others)
}

q(str.toList) map (_.mkString)

This should do the trick, right? No doubt it can be cleaned up into one-liners even further

Anne
  • 163
  • 6
  • Partition doesn't do what is wanted, I think. Given aaabbbbbaa it will return aaaaa bbbbb – The Archetypal Paul Mar 09 '11 at 16:43
  • partition splits the list in two, so splitting "aaaabbbbaaa".toList like this will yield "aaaa","bbbbaaaaa" and then the same function is applied for the remainder of the list in a recursive fashion. I think it does do what you need – Anne Mar 10 '11 at 08:36
  • Partition does not split the list at a point. It collects all the elements that match or don't match the predicate `scala> "aaabbbbaaa" partition (_ == 'a') res0: (String, String) = (aaaaaa,bbbb) scala>` – The Archetypal Paul Mar 10 '11 at 09:30
  • The function you are looking for is `span` (see Thomas' post) – Martin Ring Mar 11 '11 at 07:53
  • Yes. You are right, partition is wrong here. Didn't read through the documentation well enough. Thx – Anne Mar 11 '11 at 11:37
1

A functional* solution using fold:

def group(s : String) : Seq[String] = {
  s.tail.foldLeft(Seq(s.head.toString)) { case (carry, elem) =>
    if ( carry.last(0) == elem ) {
      carry.init :+ (carry.last + elem)
    }
    else {
      carry :+ elem.toString
    }
  }
}

There is a lot of cost hidden in all those sequence operations performed on strings (via implicit conversion). I guess the real complexity heavily depends on the kind of Seq strings are converted to.

(*) Afaik all/most operations in the collection library depend in iterators, an imho inherently unfunctional concept. But the code looks functional, at least.

Raphael
  • 9,779
  • 5
  • 63
  • 94
1

Starting Scala 2.13, List is now provided with the unfold builder which can be combined with String::span:

List.unfold("aaaabbaaacdeeffg") {
  case ""   => None
  case rest => Some(rest.span(_ == rest.head))
}
// List[String] = List("aaaa", "bb", "aaa", "c", "d", "ee", "ff", "g")

or alternatively, coupled with Scala 2.13's Option#unless builder:

List.unfold("aaaabbaaacdeeffg") {
  rest => Option.unless(rest.isEmpty)(rest.span(_ == rest.head))
}
// List[String] = List("aaaa", "bb", "aaa", "c", "d", "ee", "ff", "g")

Details:

  • Unfold (def unfold[A, S](init: S)(f: (S) => Option[(A, S)]): List[A]) is based on an internal state (init) which is initialized in our case with "aaaabbaaacdeeffg".
  • For each iteration, we span (def span(p: (Char) => Boolean): (String, String)) this internal state in order to find the prefix containing the same symbol and produce a (String, String) tuple which contains the prefix and the rest of the string. span is very fortunate in this context as it produces exactly what unfold expects: a tuple containing the next element of the list and the new internal state.
  • The unfolding stops when the internal state is "" in which case we produce None as expected by unfold to exit.
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
0

Edit: Have to read more carefully. Below is no functional code.

Sometimes, a little mutable state helps:

def group(s : String) = {
  var tmp = ""
  val b = Seq.newBuilder[String]

  s.foreach { c =>
    if ( tmp != "" && tmp.head != c ) {
      b += tmp
      tmp = ""
    }

    tmp += c
  }
  b += tmp

  b.result
}

Runtime O(n) (if segments have at most constant length) and tmp.+= probably creates the most overhead. Use a string builder instead for strict runtime in O(n).

group("aaaabbcddeeeeeeffg")
> Seq[String] = List(aaaa, bb, c, dd, eeeeee, ff, g)
Raphael
  • 9,779
  • 5
  • 63
  • 94
  • If there was some method on `collection.mutable.Seq` that actually *modified* the sequence, you could use a double-linked-list in tmp and be in O(n) time and memory. – Raphael Mar 09 '11 at 18:42
0

If you want to use scala API you can use the built in function for that:

str.groupBy(c => c).values

Or if you mind it being sorted and in a list:

str.groupBy(c => c).values.toList.sorted
Guillem I
  • 88
  • 4