0

I have a case where I'm going through a large amount of data and building up several maps. The result of my function will be Maps:

case class Maps(map1: Map[String, String], map2: Map[String, String])

I'm trying to decide whether to do this using a functional style, or the "old-fashioned" way of building up mutable maps. The latter would look roughly like

type MutableMap = scala.collection.mutable.Map[String, String]
val MutableMap = scala.collection.mutable.Map

def buildMaps(input: Something): Maps = {
    var map1: MutableMap = MutableMap()
    var map2: MutableMap = MutableMap()
    input.getAnIterator.foreach(x => {
        map1 += (key1(x) -> val1(x))
        map2 += (key2(x) -> val2(x))
    }
    Maps(map1.toMap, map2.toMap)
}

The functional alternative that I can see is something like

def addToMaps(maps: Maps, x: SomeElement): Maps =
    Maps(maps.map1 + (key1(x) -> val1(x)), maps.map2 + (key2(x) -> val2(x)))

def buildMaps(input: Something): Maps = 
    input.getAnIterator.foldLeft(Maps(Map(), Map()))(addToMaps)

[My syntax might not be exactly correct, but hopefully this gives the gist of what I'm trying to do.]

The second way seems a lot more "elegant"; but if it's implemented by making repeated copies of immutable maps, it won't be feasible (I expect input to be quite large).

Is Scala able to optimize the second solution so that its performance is comparable to the first? Is there another approach that I'm missing? Or should I just stick with the non-functional approach?

ajb
  • 31,309
  • 3
  • 58
  • 84

5 Answers5

1

You can also do .toMap on a collection of 2-element tuples. Something like:

def buildMaps(input: Something): Maps = {
  val m1 = input.getAnIterator.map(x => key1(x) -> val1(x)).toMap
  val m2 = input.getAnIterator.map(x => key2(x) -> val2(x)).toMap
  Maps(m1, m2)
}

(assuming getAnIterator returns a scala iterator or some scala collection)

Joe K
  • 18,204
  • 2
  • 36
  • 58
  • 1
    This goes through the whole input twice, though (unless Scala can figure out to parallelize the two iterations?). That's something I'd like to avoid since the input is large. (Plus it creates intermediate data structures which will also be large--see https://stackoverflow.com/questions/5582862/.) – ajb Jan 24 '18 at 22:22
  • @ajb if you are concerned about the intermediate data structures generated by `map`, you could eliminate them using `breakOut`. Example: import scala.collection.breakOut; val xs = 0 to 10; val map: Map[Int, Int] = xs.map(x => (x, x * x))(breakOut); println(map) – Andrey Tyukin Jan 24 '18 at 22:54
1

I would say that your first implementation of the method is perfectly fine. It also has the advantage that it uses the iterator only once, and does not traverse the entire collection twice. After all, this is what the "default ambient machine-state monad" in Scala is best suited for: modifying mutable data structures. Where else would you use mutable variables, if not in this case? The default implementations of the functional collection operations (like map, filter etc.) use mutable Builders under the hood anyway.

I'd like to quote Odersky himself: https://www.youtube.com/watch?v=iPitDNUNyR0&t=34m1s . This is ScalaDays2013 key note. Around ~30 minutes mark, Odersky offers his opinion on using mutable variables in small local scope. I think his point is: if the mutable version is faster, clearer, and no mutable state can escape from the method, then it's fine to use mutable local variables.

I doubt that Scala would automatically optimize the second solution into the first, and I actually suspect that the first solution could be a tiny bit faster. However, you should profile it, and only then decide whether this piece of code is worth optimizing at all.

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
1

As you can see here, Scala's immutable HashMap implementation does insertions in effectively constant time:

enter image description here

So - while your concern about performance is not unrealistic, given this information we can conclude both versions would probably have comparable performance. And if that's the case - I'd definitely choose the safer and more concise functional style.

If you're unsure about which implementation is actually used under the hood when you use Map(), you can specifically instantiate a new HashMap[String, String]() to make sure this is the implementation used.

Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • It's not insertion I'm worried about, it's whether maps have to be copied to implement `+` in the second code snippet. – ajb Jan 24 '18 at 22:25
  • That's what I mean by "insertion", perhaps I should have used "addition". The fact that immutable `HashMap`'s "add" operation (which is exactly `+`) is effectively-constant means the data isn't copied (otherwise it couldn't have been constant) – Tzach Zohar Jan 24 '18 at 22:30
  • I don't understand. If `m` is an immutable map, and you say `val a = m; val b = m + (newKey -> newValue)`, how can `a` and `b` both be immutable maps with different values unless it makes a copy of `m`? – ajb Jan 24 '18 at 23:20
  • Since `a` is immutable, `b` can simply hold a _reference_ to `a` and just "add" its own key-value pairs on top of `a`. `a` won't change so they both can "live" side-by-side while sharing the same objects in memory. – Tzach Zohar Jan 24 '18 at 23:24
  • Still doesn't make sense. If the `+` operator on maps returns a map that has a reference to the previous map, and you use this 100 times, then you have a chain of 100 maps that have to be searched when you do a lookup. So how is this consistent with a `HashMap` that works in constant time? I can think of a possible implementation assuming the hash buckets are linked lists and the earlier maps can point to the middle of those lists, but immutable maps also have a remove operator--then what? – ajb Jan 24 '18 at 23:33
  • If you can point me to some info about how this is implemented, I'd like to see it. Otherwise I have trouble believing you. – ajb Jan 24 '18 at 23:34
0

The "idiomatic" answer could be a third one, unless you have any kind of concurrency concerns, but it sounds like we could leave those out. Scala's internal collection abstracts building collections into something called scala.collection.CanBuildFrom, which gives you a mutable API with an immutable result() method.

def buildMaps(input: Something): Maps = {
  val builder = Map.newBuilder[A, B]
  val m1 = input.getAnIterator.foreach(i => builder += i -> i)
  Maps(builder.result())
}

Basically this is the pattern used under the hood anyway, so it's using the lowest level API.

flavian
  • 28,161
  • 11
  • 65
  • 105
  • How does it help to build _two_ maps with two disjoint key sets? You could do this, but you'd have to define your own Builder[InputElemType, (Map[], Map[])], which in turn would again require the solution to the same problem, just wrapped into the Builder interface? – Andrey Tyukin Jan 24 '18 at 22:28
0

I would be inclined to taking the functional approach given that immutable Map (particularly HashMap) is by design efficient in lookup/add/remove, as pointed out by others.

As to the concern re: cost for copying immutable Maps, my understanding is that internally these Maps aren't literally copied. If you take a look at the source code of HashMap, you'll notice it's implemented using HashTrieMap. One important characteristic of a hash trie data structure is that when updating, only the path from the root to the leaf that stores the key gets rewritten. The rest of the trie remains unchanged. Here's a document about hash trie.

Leo C
  • 22,006
  • 3
  • 26
  • 39