13

I'm getting this error

Can't pickle <class 'google.protobuf.pyext._message.CMessage'>: it's not found as google.protobuf.pyext._message.CMessage

when I try to create a UDF in PySpark. Apparently, it uses CloudPickle to serialize the command however, I'm aware that protobuf messages contains C++ implementations, meaning it cannot be pickled.

I've tried finding a way to perhaps override CloudPickleSerializer, however, I wasn't able to find a way.

Here's my example code:

from MyProject.Proto import MyProtoMessage
from google.protobuf.json_format import MessageToJson
import pyspark.sql.functions as F

def proto_deserialize(body):
  msg = MyProtoMessage()
  msg.ParseFromString(body)
  return MessageToJson(msg)

from_proto = F.udf(lambda s: proto_deserialize(s))

base.withColumn("content", from_proto(F.col("body")))

Thanks in advance.

Marc Vitalis
  • 2,129
  • 4
  • 24
  • 36
  • This seems to work fine with pyspark 3.2.1 and protoc 3.19.4. – bzu Jul 28 '22 at 18:09
  • im getting the same error. PicklingError: Could not serialize object: TypeError: cannot pickle 'google._upb._message.Descriptor' object. pyspark 3.3.0 i am wondering if it has something to do with converting from a C# proto file with BCL in it – Anton Oct 24 '22 at 07:51

0 Answers0