If I can use a user-defined function to a RDD

Question

I want to use a md5 function to RDD[(String,Array[Double])], but there is an Error of Null pointer exception. And I found the question on stack overflow.call of distinct and map together throws NPE in spark library.

my code:

def md5(s: String) = {
    MessageDigest.getInstance("MD5").digest(s.getBytes).
           map("%02x".format(_)).mkString.substring(0,8)
  }

val rdd=sc.makeRDD(Array(1,8,6,4,9,3,76,4))//.collect().foreach(println)
val rdd2 = rdd.map(r=>(r+"s",Array(1.0,2.0)))

rdd2.map{
  case(a,b) => (md5(a)+"_"+a,b)
}.foreach(println)

in the local mode, it's ok, but in the cluster mode, it's error.

 java.lang.NullPointerException

Can I have another way to do this? thx :)

error:

Exception in thread "main" java.lang.NullPointerException                       
    at no1.no1$.no1$no1$$md5$1(no1.scala:139)
    at no1.no1$$anonfun$8.apply(no1.scala:143)
    at no1.no1$$anonfun$8.apply(no1.scala:141)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at no1.no1$.main(no1.scala:141)
    at no1.no1.main(no1.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

the code above is an example, but this code seems to be right. I am confused.

Question you've linked is not relevant here. Could you provide __full__ traceback? `NullPointerException` without context is not very meaningful. Moreover this code seem to work just fine both in local mode and on cluster. — zero323, Jan 06 '16 at 19:35
the reason maybe is my hbase's conf SCAN_COLUMNS has some problems... — WicleQian, Jan 07 '16 at 12:55

score 0 · Answer 1 · answered Jan 07 '16 at 14:16

I see no way for the RDD to provide a null string to your MD5 function, and the failure is clearly inside it:

java.lang.NullPointerException
at no1.no1$.no1$no1$$md5$1(no1.scala:139) <-- here!

My money would be that the static call MessageDigest.getInstance("MD5") is returning null on the executors. That or the .digest call. Check for conditions on which that can happen, maybe the inputs your are trying locally do no contain the failure cases.

If I can use a user-defined function to a RDD

1 Answers1