30

I really like functional programming concepts, but I've been bitten on two separate occasions now by the same gotcha, when mapping across a collection which happens to be a Set (i.e. automatically removes duplicates). The issue is that after transforming the elements of such a set, the output container is also a set, and so removes any duplicates of the tranformed output.

A very brief REPL session to illustrate the issue:

scala> case class Person(name: String, age: Int)
defined class Person

scala> val students = Set(Person("Alice", 18), Person("Bob", 18), Person("Charles", 19))
students: scala.collection.immutable.Set[Person] = Set(Person(Alice,18), Person(Bob,18), Person(Charles,19))

scala> val totalAge = (students map (_.age)).sum
totalAge: Int = 37

I would of course expect the total age to be 18 + 18 + 19 = 55, but because the students were stored in a Set, so were their ages after the mapping, hence one of the 18s disappeared before the ages were summed.

In real code this is often more insidious and harder to spot, especially if you write utility code which simply takes a Traversable and/or use the output of methods which are declared to return a Traversable (the implementation of which happens to be a Set). It seems to me that these situations are almost impossible to spot reliably, until/unless they manifest as a bug.

So, are there any best practices which will reduce my exposure to this issue? Am I wrong to think about map-ping over a general Traversable as conceptually transforming each element in place, as opposed to adding the transformed elements in turn into some new collection? Should I call .toStream on everything before mapping, if I want to keep this mental model?

Any tips/recommendations would be greatly appreciated.

Update: Most of the answers so far have focused on the mechanics of including the duplicates in the sum. I'm more interested in the practices involved when writing code in the general case - have you drilled yourself to always call toList on every collection before calling map? Do you fastidiously check the concrete classes of all the collections in your app before calling methods on them? Etc.

Fixing up something that's already been identified as a problem is trivial - the hard part is preventing these errors from creeping in in the first place.

Andrzej Doyle
  • 102,507
  • 33
  • 189
  • 228
  • +1, nice example. Leaving as a Set will obviously bring surprising results, so you "must" do something, creating a new collection, or other means to ensure that what-you-see-is-what-you-will-get – virtualeyes Apr 03 '12 at 11:36
  • 2
    Exact duplicate of: http://stackoverflow.com/questions/7040806/when-applying-map-to-a-set-you-sometimes-want-the-result-not-to-be-a-set-but. Though the title is chosen better here. – ziggystar Apr 03 '12 at 12:18
  • Thanks for the pointer, @ziggy. I did try to search for existing questions first, but didn't come up with anything (it's difficult to arrive at a definitive description). – Andrzej Doyle Apr 03 '12 at 14:45

6 Answers6

19

You might wish to use the scalaz foldMap for this purpose, as it works on anything for which there is a Foldable typeclass available. The usage in your case will look like this:

persons foldMap (_.age)

The signature of foldMap is as follows:

trait MA[M[_], A] {
  val value: M[A]

  def foldMap[B](f: A => B)(implicit f: Foldable[M], m: Monoid[B])
}

So; as long as you have some collection CC[A] where CC can be folded over (i.e. traversed), a function from A => B where B is a monoid, you can accumulate a result.

oxbow_lakes
  • 133,303
  • 56
  • 317
  • 449
  • I like this - Dave Griffith's answer was on the right tracks but is still quite specific. This covers every situation where a collection is mapped and then combined into a single result (via `Monoid`), which in practice is where I've seen this problem. – Andrzej Doyle Apr 03 '12 at 15:19
  • (0 /: students) { case (sum, s) => sum + s.age } – Viktor Klang Apr 03 '12 at 15:28
  • 2
    `(0 /: students)(_ + _.age)` is even shorter – oxbow_lakes Apr 03 '12 at 15:55
  • 1
    I think the issue with the standard fold is that it's less readable when used in place (e.g. as an argument to a method). It also requires a break in the flow of thinking (for me). that is, you think *I want to perform an operations on my collection*, so you want `cc.op`. I suppose `cc.foldl(0)(_ + _.age)` would do that – oxbow_lakes Apr 03 '12 at 15:57
  • 3
    Yup, you _ +_.age is shorter, but I think there's a balance to be had between terseness and readability. – Viktor Klang Apr 03 '12 at 16:28
11

As not to drag extra dependencies into the picture:

(0 /: students) { case (sum, s) => sum + s.age }
Viktor Klang
  • 26,479
  • 7
  • 51
  • 68
  • Are you suggesting not using `.map()` at all, in order to avoid my mistake? – Andrzej Doyle Apr 03 '12 at 15:45
  • Your example doesn't need map at all, using map also introduces the creating of an intermediate data structure, which in your example is unnecessary. So in this example you solve the problem by more carefully and efficiently expressing your intent. – Viktor Klang Apr 03 '12 at 15:52
  • That's a good point; most if not all of these problems arise when I immediately transform the mapped collection into something else (e.g. the `.sum` here). In which case the intermediate collection doesn't need to exist and so you might be right that avoiding `.map` altogether is a practical solution. – Andrzej Doyle Apr 03 '12 at 16:03
  • Odd that it works both with and without the `case`; I both a Function2 as well as PartialFunction are accepted. – Erik Kaplun Sep 18 '14 at 13:51
3

You can breakOut the collection type

scala> import collection.breakOut
import collection.breakOut

scala> val ages = students.map(_.age)(breakOut): List[Int]
ages: List[Int] = List(18, 18, 19)

Then you can sum as expected

Based on the update to the question, the best practice to prevent these types of errors is good unit test coverage with representative data, together with sensible API's coupled with knowledge of how the scala compiler maintains source types through map/for generators etc. If you're returning a Set of something you should make that obvious, as returning a Collection/Traversable is hiding a relevant implementation detail.

Jon Freedman
  • 9,469
  • 4
  • 39
  • 58
2

You might like to use the toIterable or toList methods to convert the set to another datastructure first. http://www.scala-lang.org/api/current/scala/collection/immutable/Set.html

(Note that toIterable may return any Iterable, although the reference implementation will not, according to the linked documentation. @Debilski informs me in the comments that it nevertheless returns a Set.)

Marcin
  • 48,559
  • 18
  • 128
  • 201
2

If you find yourself repeatedly hitting the same error, your first problem isn't the error, but rather that you're repeating yourself. map().sum is a common enough use case (particularly in data analysis contexts) to merit it's own method on Traversable. From my personal, never-go-anywhere-without-it Traversable pimp class.

  implicit def traversable2RichTraversable[A](t: Traversable[A]) = new {
///many many methods deleted

    def sumOf[C: Numeric](g: A => C): C = t.view.toList.map(g).sum

///many many more methods deleted

}

(That .view may not be necessary, but can't hurt.)

Dave Griffith
  • 20,435
  • 3
  • 55
  • 76
  • 2
    `sumBy` seems like a name that's more in line with the std lib. – ziggystar Apr 03 '12 at 13:10
  • 1
    +1, I would remove the `.view` unless there is actually an example where it helps. – huynhjl Apr 03 '12 at 13:46
  • 2
    could you share your personal, never-go-anywhere-without-it Traversable pimp class? – Adam Rabung Apr 03 '12 at 14:36
  • I use "sumBy" for my pimped traversable method that's equivalent (but more efficient than) traversable.toList.groupBy(f).mapValues(g).sum . This splits a traversable into buckets based on some function f, then sums some function g over the values in each bucket. Again, very common for data analysis (more so than sumOf, actually). – Dave Griffith Apr 03 '12 at 15:27
  • It will take some work to get that class into shape for public viewing (and get work permission to make it public), but I'll see what I can do. As an exercise, based on what you've seen so far, ponder what goodies it might contain (and then write a few of them yourself). – Dave Griffith Apr 03 '12 at 15:31
  • I believe there are at least a few classes in the standard lib where adding that ".view" means the method doesn't have to create a new collection of the same size as the original. OTOH, I'm not sure if it prevents the parallel collections from doing their magic. Will research. – Dave Griffith Apr 03 '12 at 15:34
  • `toSeq` better than `toList` since it won't unnecessarily convert IndexedSeqs into LinearSeqs – Luigi Plinge Apr 03 '12 at 23:11
1

A clumsy but possibly faster way of transforming it (as compared to an explicit toList/toSeq) would be to use collection.breakOut (more information) with a type ascription

(students map (_.age))(collection.breakOut) : Seq[Int]
Community
  • 1
  • 1
Debilski
  • 66,976
  • 12
  • 110
  • 133