16

I'm on my second evening of scala, and I'm resisting the urge to write things in scala how I used to do them in java and trying to learn all of the idioms. In this case I'm looking to just compute an average using such things as closures, mapping, and perhaps list comprehension. Irrespective of whether this is the best way to compute an average, I just want to know how to do these things in scala for learning purposes only

Here's an example: the average method below is left pretty much unimplemented. I've got a couple of other methods for looking up the rating an individual userid gave that uses the find method of TraversableLike (I think), but nothing more that is scala specific, really. How would I compute an average given a List[RatingEvent] where RatingEvent.rating is a double value that I'd to compute an average of across all values of that List in a scala-like manner?.

package com.brinksys.liftnex.model

class Movie(val id : Int, val ratingEvents : List[RatingEvent]) {

    def getRatingByUser(userId : Int) : Int =  {
        return getRatingEventByUserId(userId).rating
    }

    def getRatingEventByUserId(userId : Int) : RatingEvent = {
        var result = ratingEvents find {e => e.userId == userId }
        return result.get
    }

    def average() : Double = {
        /* 
         fill in the blanks where an average of all ratingEvent.rating values is expected
        */
       return 3.8
    }
}

How would a seasoned scala pro fill in that method and use the features of scala to make it as concise as possible? I know how I would do it in java, which is what I want to avoid.

If I were doing it in python, I assume the most pythonic way would be:

sum([re.rating. for re in ratingEvents]) / len(ratingEvents)

or if I were forcing myself to use a closure (which is something I at least want to learn in scala):

reduce(lambda x, y : x + y, [re.rating for re in ratingEvents]) / len(ratingEvents)

It's the usage of these types of things I want to learn in scala.

Your suggestions? Any pointers to good tutorials/reference material relevant to this are welcome :D

Ken Williams
  • 22,756
  • 10
  • 85
  • 147
whaley
  • 16,075
  • 10
  • 57
  • 68

5 Answers5

37

If you're going to be doing math on things, using List is not always the fastest way to go because List has no idea how long it is--so ratingEvents.length takes time proportional to the length. (Not very much time, granted, but it does have to traverse the whole list to tell.) But if you're mostly manipulating data structures and only occasionally need to compute a sum or whatever, so it's not the time-critical core of your code, then using List is dandy.

Anyway, the canonical way to do it would be with a fold to compute the sum:

(0.0 /: ratingEvents){_ + _.rating} / ratingEvents.length

// Equivalently, though more verbosely:
// ratingEvents.foldLeft(0.0)(_ + _.rating) / ratingEvents.length

or by mapping and then summing (2.8 only):

ratingEvents.map(_.rating).sum / ratingEvents.length

For more information on maps and folds, see this question on that topic.

Community
  • 1
  • 1
Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • Good answer, thanks! I had to do some further looking up to grok what :/ is (operator for something like foldLeft on certain types) and I had to look at the documentation on foldLeft to figure out why you had it written as (_ + _.rating), which looked totally unintuitive until I peeked at the foldLeft docs text for what it returns. The 2.8 example was a bit easier to get at the first glance. Also, what would you suggest as an alternative to List for heavy math work? I only picked List out of habit, really as I'm still just learning. – whaley Aug 17 '10 at 17:58
  • @whaley: I'd use `Array` for genuinely math-heavy work--and, sadly, I'd mostly loop through it using `var sum,i=0; while (i – Rex Kerr Aug 17 '10 at 18:46
  • @RexKerr good info, thanks. I did some benchmarking on this with jmh using 9,000 elements as that is my use case and found: Array is 10-60% faster than List (4300 vs 2700 ops/sec) foldLeft is about 10-25% faster than sum for both (5300 vs 4300 and 3000 vs 2700 ops/sec) Also tried Vector for completeness: (3600 sum or 5200 foldLeft) With 1,000,000 elements the difference is: Array vs List: 34 vs 24 ops/sec Array foldLeft vs sum: 40 vs 34 ops/sec List foldLeft vs sum: 23 vs 24 ops/sec Vector sum: 28 ops/sec – danio May 17 '19 at 11:02
  • @danio - Note that these are all ridiculously slow for math on primitive numbers (Int, Double, etc). A modern computer should be able to do way more than 40M additions per second. Use a while loop if you actually want to go fast. – Rex Kerr May 23 '19 at 18:32
12

You might calculate sum and length in one go, but I doubt that this helps except for very long lists. It would look like this:

val (s,l) = ratingEvents.foldLeft((0.0, 0))((t, r)=>(t._1 + r.rating, t._2 + 1)) 
val avg = s / l

I think for this example Rex' solution is much better, but in other use cases the "fold-over-tuple-trick" can be essential.

Landei
  • 54,104
  • 13
  • 100
  • 195
  • I'd probably resort to using a `var len=0` that was incremented by the closure before tupling this way. A second traversal of a list element is cheaper than the (transient) creation of a new tuple every element. But it is good to see this pattern; there are cases where it is a great strategy (at least for compact and powerful code; it never has superlative performance, but most of the time when you use it you wouldn't need that performance). – Rex Kerr Aug 17 '10 at 18:49
10

Since mean and other descriptive statistics like standard deviation or median are needed in different contexts, you could also use a small reusable implicit helper class to allow for more streamlined chained commands:

  implicit class ImplDoubleVecUtils(values: Seq[Double]) {

    def mean = values.sum / values.length
  }

  val meanRating = ratingEvents.map(_.rating).mean

It even seems to be possible to write this in a generic fashion for all number types.

Community
  • 1
  • 1
Holger Brandl
  • 10,634
  • 3
  • 64
  • 63
2

Tail recursive solution can achieve both single traversal and avoid high memory allocation rates

def tailrec(input: List[RatingEvent]): Double = {
  @annotation.tailrec def go(next: List[RatingEvent], sum: Double, count: Int): Double = {
    next match {
      case Nil => sum / count
      case h :: t => go(t, sum + h.rating, count + 1)
    }
  }
  go(input, 0.0, 0)
}

Here are jmh measurements of approaches from above answers on a list of million elements:

[info] Benchmark                                Mode              Score     Units
[info] Mean.foldLeft                            avgt              0.007     s/op
[info] Mean.foldLeft:·gc.alloc.rate             avgt           4217.549     MB/sec
[info] Mean.foldLeft:·gc.alloc.rate.norm        avgt       32000064.281     B/op
...
[info] Mean.mapAndSum                           avgt              0.039     s/op
[info] Mean.mapAndSum:·gc.alloc.rate            avgt           1690.077     MB/sec
[info] Mean.mapAndSum:·gc.alloc.rate.norm       avgt       72000009.575     B/op
...
[info] Mean.tailrec                             avgt              0.004     s/op
[info] Mean.tailrec:·gc.alloc.rate              avgt             ≈ 10⁻⁴     MB/sec
[info] Mean.tailrec:·gc.alloc.rate.norm         avgt              0.196     B/op

Mario Galic
  • 47,285
  • 6
  • 56
  • 98
1

I can suggest 2 ways:

def average(x: Array[Double]): Double = x.foldLeft(0.0)(_ + _) / x.length

def average(x: Array[Double]): Double = x.sum / x.length

Both are fine, but in 1 case when using fold you can not only make "+" operation, but as well replace it with other (- or * for example)

vindev
  • 2,240
  • 2
  • 13
  • 20
Artem Vlasenko
  • 115
  • 1
  • 1
  • 7