I'm trying to use Naive Bayes to get some predictions and I think I did it, but the results came in pyspark.ml.linalg.DenseVector
and is necessary to use a UDF
to extract the max value from this predictions. But when I try to do this I'm receiving a error message that I can understand very well. I suspect that is something in data, but is difficult to say. So, I need clarification here if it's possible.
After prediction, final dataframe
final.select(['probability']).show(1,False)
That prints(removing lots of unnecessary chars)
[0.003757082491927337,0.006619699990962011,4.3419224521402993E-4,5.372777916422481E-4,0.008699903117263496,0.00921345642395094,0.009308914882494697,0.009532505240756398,0.009574215735530826,0.009942975085100142,0.003746443859484387,0.0017932735490852503,0.19945386664486442,0.005124182867179237,0.005089059078247119,0.007687079453881513,0.0025867216196684335,0.006587085820814359,0.0046586814382427715,0.003472875972621646,0.003472875972621646,0.0019978351753593363,0.07453890301188439,0.04191544981001385,0.003472875972621646,0.006060177344013417,0.0044178878429373845,0.005477673073302646,0.0026645750909332056,0.005477673073302646,0.0033605681622028913,0.0017607165865065513,0.005164218768408131,0.03186193509859075,0.003202843488791911,0.0021707007624092475,0.0289259217515303,0.002170547482888538,0.003092523571613203,0.003717084552951576,0.003092523571613203,0.003717084552951576,0.003092523571613203,0.0025850524324637575,0.003092523571613203,0.0025180433769619766,0.004128049967861509,0.002957690684832795,0.002424681067867312,0.0032320945834738157,0.003748651440639081,0.002424681067867312,0.0027953016195932626,0.0026019591870661144,0.0026019591870661144,0.0026019591870661144,0.001609746125199844,0.0023004260882032836,0.0026019591870661144,0.002637381851293657,0.002373870061102953,0.011592795566489985,0.002140021680448195,0.014243220366617711,0.002373870061102953,0.012840130082689164,0.002373870061102953,0.002140021680448195,0.0021067973358856397,0.0021067973358856397,0.0021067973358856397,0.0017833514003735016,0.00249904614224033,0.00249904614224033,0.0021067973358856397,0.00249904614224033,0.0021067973358856397,0.0013989129872010988,0.00249904614224033,0.009686890912962319,0.0021067973358856397,0.00249904614224033,0.00249904614224033,0.0016431614915737756,0.0017960070941398095,0.022362412956746066,0.0014868338211806395,0.0016854378687085135,0.0017960070941398095,0.0016854378687085135,0.0014868338211806395,0.006741751474834078,0.0017960070941398095,0.0017960070941398095,0.001914944241750504,0.0017960070941398095,0.0017960070941398095,0.007659776967002045,0.0017960070941398095,0.007659776967002045,0.001914944241750504,0.0014868338211806395,0.0017960070941398095,0.001914944241750504,0.0017960070941398095,0.0017960070941398095,0.0017960070941398095,0.007659776967002045,0.002042968531509297,0.005590603239186536,0.0015825800407353087,0.0017960070941398095,0.0017960070941398095,0.0015825800407353087,0.006741751474834078,0.007659776967002045,0.001164670359560697,0.002042968531509297,0.0013760166559538393,0.0014994276853441905,0.0014362081813128692,0.0037922352045941772,0.0014362081813128692,0.0014362081813128692,0.0014362081813128692,0.0014362081813128692,0.0013760166559538393,0.00121203546880268,0.0014362081813128692,0.0014362081813128692,0.0013760166559538393,0.0014362081813128692,0.0014362081813128692,0.0014362081813128692,0.004308624543938659,0.001318690925646826,0.0014994276853441905,0.0015658476142623823,0.0014362081813128692,0.0013760166559538393,0.0010269702301250521,0.0014362081813128692,0.004128049967861509,0.0014362081813128692,0.001162426909555469,0.0014362081813128692,0.0014362081813128692,0.004128049967861509,0.0015658476142623823,0.0014994276853441905,0.0013760166559538393,0.0013760166559538393,0.0013760166559538393,0.0013760166559538393,0.001318690925646826,0.001956571875734274,9.996184568961312E-4,0.0019992369137922664,9.574721208752502E-4,0.001914944241750504,0.001914944241750504,0.0010668764063125208,0.0019992369137922664,0.0020877968190164925,0.0010668764063125208,0.0010214842657546466,0.0010214842657546466,0.0010668764063125208,9.574721208752502E-4,0.0010214842657546466,0.0010214842657546466,0.0019992369137922664,0.0010214842657546466,0.0010438984095082593,0.0010214842657546466,9.782859378671353E-4,0.0010214842657546466,9.782859378671353E-4,0.0010214842657546466,9.996184568961312E-4,0.0018743257203195504,9.996184568961312E-4,0.001956571875734274,0.0010214842657546466,9.996184568961312E-4,0.0010214842657546466,9.996184568961312E-4,0.0019992369137922664,0.0010214842657546466,0.001914944241750504,0.0010214842657546466,0.0010214842657546466,0.0020877968190164925,0.0010214842657546466,9.782859378671353E-4,0.0010214842657546466,0.0010214842657546466,0.0010214842657546466,9.782859378671353E-4,0.0010668764063125208,0.0010214842657546466,9.574721208752502E-4,0.0010214842657546466,9.782859378671353E-4,0.0020429685315092824,0.0010668764063125208,0.0010214842657546466,9.782859378671353E-4,0.0010214842657546466,9.996184568961312E-4,9.574721208752502E-4,9.782859378671353E-4,0.0010214842657546466,0.0010214842657546466,9.782859378671353E-4,9.996184568961312E-4,0.0010214842657546466,0.0010214842657546466,0.0010214842657546466,0.001956571875734274,9.996184568961312E-4,0.0010214842657546466,0.0010214842657546466,0.0010214842657546466,0.0010214842657546466,0.0010214842657546466,0.0019992369137922664,0.0010214842657546466,9.782859378671353E-4,0.0010214842657546466,0.0019992369137922664,9.574721208752502E-4,0.0010214842657546466,9.996184568961312E-4,9.782859378671353E-4,0.0019992369137922664,9.996184568961312E-4,9.996184568961312E-4,0.0010214842657546466,9.574721208752502E-4,0.0018743257203195504,0.0010214842657546466,9.996184568961312E-4,0.0019992369137922664,0.0019992369137922664,0.0010668764063125208,9.996184568961312E-4,0.0010438984095082593,0.0019992369137922664,0.003913143751468528,0.0010214842657546466,9.996184568961312E-4,0.0010214842657546466,0.0010214842657546466,0.0018743257203195504,0.0010214842657546466,0.0010214842657546466,0.0010214842657546466]
Getting max value with UDF
def max_binarizer(vector):
max_val = float(max(vector))
return max_val
max_bin_udf = F.udf(max_binarizer, FloatType())
Later...
final = final.withColumn("PROB", max_bin_udf(final['probability']))
final.show(1)
Error message:
Py4JJavaError: An error occurred while calling o956.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 105.0 failed 1 times, most recent failure: Lost task 0.0
in stage 105.0 (TID 6266, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line
177, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line
172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line
104, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line
71, in <lambda>
return lambda *a: f(*a)
File "<ipython-input-73-794830053905>", line 3, in max_binarizer
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 40, in _
jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column)
else col)
AttributeError: 'NoneType' object has no attribute '_jvm'
I check the type of 'vector'
parameter in the UDF. It's pyspark.ml.linalg.DenseVector
I've tried to figure out by similar questions. One of them suggest a conflict between Python functions and pySpark functions. Unfortunately the result is the same! Help me, please!