Do you benefit from the Kryo serializer when you use Pyspark?

Question

I read that the Kryo serializer can provide faster serialization when used in Apache Spark. However, I'm using Spark through Python.

Do I still get notable benefits from switching to the Kryo serializer?

eliasah · Accepted Answer · 2017-12-21T09:33:28.303

18

Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java.

But it may be worth a try — you would just set the spark.serializer configuration and trying not to register any classe.

What might make more impact is storing your data as MEMORY_ONLY_SER and enabling spark.rdd.compress, which will compress them your data.

In Java this can add some CPU overhead, but Python runs quite a bit slower, so it might not matter. It might also speed up computation by reducing GC or letting you cache more data.

Reference : Matei Zaharia's answer in the mailing list.

edited Dec 21 '17 at 09:33

answered Mar 29 '16 at 08:01

eliasah

39,588
11
124
154

Wow, a detailed answer so fast! Thanks. Is the part from "What might make..." refering to the serializer, or and independent suggestion for optimization? – Gere Mar 29 '16 at 08:23
it's more of a suggestion for optimization since the Kryo won't have impact on PySpark. I suggest testing it first. I don't use PySpark excessively to test it and performance may depend on lots of things : configuration, use cases, network, etc. – eliasah Mar 29 '16 at 08:26
Ok, even while it won't make an impact, is it still possible to use Kryo for pyspark and if yes then how? @eliasah – Anna Leonenko Jul 21 '20 at 00:11
did anyone try this ? what do you mean by `you would just set the spark.serializer`, just set empty value ? if there is a value to set, what it might be – bicepjai Oct 17 '20 at 00:14
@bicepjai: User [Saikat](https://stackoverflow.com/users/11980605/saikat) suggests that you can accomplish this using `conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")`, per [the documentation](https://spark.apache.org/docs/latest/tuning.html). Saikat doesn't yet have the reputation to leave this as a comment, so instead left it as an answer. Since that answer will likely get deleted, I'm reiterating their guidance here. – Jeremy Caney Oct 19 '20 at 20:09
Thanks. I use spark 2.4, can we use kryo in pyspark ? I read pyspark cant use kryo some where and cant find the reference again. – bicepjai Oct 20 '20 at 19:01
@eliasah By saying that `PySpark` stores data as `bytes[]`, you mean the bytes resulted by the python serializer? i.e Pickle or Marshal? – Ofek Hod Dec 06 '21 at 16:20

score 7 · Answer 2 · edited Oct 28 '16 at 20:00

It all depends on what you mean when you say PySpark. In the last two years, PySpark development, same as the Spark development in general, shifted from the low level RDD API towards high level APIs like DataFrame or ML.

These APIs are natively implemented on JVM and the Python code is mostly limited to a bunch of RPC calls executed on the driver. Everything else is pretty much the same code as executed using Scala or Java so it should benefit from Kryo in the same way as the native applications.

I will argue that at the end of the day there is not much to lose when you use Kryo with PySpark and potentially something to gain when your application depends heavily on the "native" APIs.

But if I finally want to use Kryo with pyspark, how would I do it? @anmol — Anna Leonenko, Jul 21 '20 at 00:14

Do you benefit from the Kryo serializer when you use Pyspark?

2 Answers2

Linked