I'm a scala newbie, using pyspark extensively (on DataBricks, FWIW). I'm finding that Protobuf deserialization is too slow for me in python, so I'm porting my deserialization udf to scala.
I've compiled my .proto
files to scala and then a JAR using scalapb
as described here
When I try to use these instructions to create a UDF like this:
import gnmi.gnmi._
import org.apache.spark.sql.{Dataset, DataFrame, functions => F}
import spark.implicits.StringToColumn
import scalapb.spark.ProtoSQL
// import scalapb.spark.ProtoSQL.implicits._
import scalapb.spark.Implicits._
val deserialize_proto_udf = ProtoSQL.udf { bytes: Array[Byte] => SubscribeResponse.parseFrom(bytes) }
I get the following error:
command-4409173194576223:9: error: could not find implicit value for evidence parameter of type frameless.TypedEncoder[Array[Byte]]
val deserialize_proto_udf = ProtoSQL.udf { bytes: Array[Byte] => SubscribeResponse.parseFrom(bytes) }
I've double checked that I'm importing the correct implicits, to no avail. I'm pretty fuzzy on implicits, evidence parameters and scala in general.
I would really appreciate it if someone would point me in the right direction. I don't even know how to start diagnosing!!!
Update
It seems like frameless
doesn't include an implicit encoder for Array[Byte]
???
This works:
frameless.TypedEncoder[Byte]
this does not:
frameless.TypedEncoder[Array[Byte]]
The code for frameless.TypedEncoder
seems to include a generic Array
encoder, but I'm not sure I'm reading it correctly.
@Dymtro, Thanks for the suggestion. That helped.
Does anyone have ideas about what is going on here?
Update
Ok, progress - this looks like a DataBricks issue. I think that the notebook does something like the following on startup:
import spark.implicits._
I'm using scalapb
, which requires that you don't do that
I'm hunting for a way to disable that automatic import now, or "unimport" or "shadow" those modules after they get imported.