1

I am getting Null pointer exception when broadcasting a Dataframe and trying to access them in a Spark UDF.

UDF definition-

def test_udf(parm1: String,  parm2: String,  paarm3: String,  ) = {
println ("Inside UDF ")           
B.value.take(1).foreach { println }
println("after print") 

..... ....... }

> sqlContext.udf.register("test_udf", test_udf _)

Broadcasting-

val B = sc.broadcast(sqlContext.sql("""Select * FROM table_a where col1='10102'""")) // Returns almost 20 MB data

Accessing UDF-

val df = sqlContext.sql("SELECT test_udf(parm1,parm2,parm3) AS test FROM table_b").take(1)

After this line i am getting null pointer exception in UDF at below line B.value.take(1).foreach { println }

I am suspecting that Broadcast is not happening correctly. Is it something wrong in this code? Using Spark 1.6.1

S. K
  • 495
  • 2
  • 7
  • 14

1 Answers1

2

You get an exception because it is not a valid Spark program:

  • broadcasting DataFrame object is not a meaningful operation. This is why we have broadcast join hints.
  • Spark doesn't support nested operations on distributed data structure. In other words you cannot access DataFrame inside an UDF.
Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks for your comment. Join is not possible because keys are not in table_b for join. I need to do a lookup through table_a for pulling right value. Is there a way to achieve that? – S. K Jul 07 '16 at 15:39
  • Unless it is local structure (collected rows for example) it is not possible without joining data. – zero323 Jul 07 '16 at 16:10
  • Thank you. Your both points are very right. Learnt it the hard way. – S. K Jul 14 '16 at 17:32
  • 1
    Thanks. Post helped me to actually fix legacy code. – dinesh028 Aug 26 '17 at 11:34