0

I am writing functions for the PySpark command RDD.aggregate which asks for the following parameters: aggregate(zeroValue, seqOp, combOp).

Can I use mutable objects for all of these parameters, without messing up the logic?

Basically for efficiency I expect calls to something like

  • zeroValue.add(other)
  • def seqOp(x1, x2): return x1.add(x2)
  • def seqOp(x1, x2): return x1.combine(x2)

All of the methods will return self. This way I do not have to re-allocate objects.

Community
  • 1
  • 1
Gere
  • 12,075
  • 18
  • 62
  • 94

1 Answers1

2

Yes, you can use mutable object as aggregation buffers for methods like fold(ByKey) or aggregate(ByKey) as it is clearly stated in the docstring:

The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

Buffers (zeroValue) are initialized once per task and because of that are safe to mutate.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Are you sure about `zeroValue`? Because it is created once in my script before Spark even knows about it. If Spark needs multiple copies, it would have to deepcopy it to all workers. Is that what's going on? – Gere May 06 '16 at 15:23
  • Yes, I am sure. There is no deep copy needed because it will be serialized, pass through 2 or 3 JVMs and deserialized in each executor process. – zero323 May 06 '16 at 15:33
  • One thing you should avoid in general unless you really understand what is going on is using mutable variables defined in modules. These are useful for handling objects that cannot be serialized (see http://stackoverflow.com/q/35189312/1560062) but can be unpredictable when mutated as a part of the application logic (see http://stackoverflow.com/a/34510795/1560062) – zero323 May 07 '16 at 10:14