I have a dataframe in this format
df08.select('scaled').show(5, truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------+
|scaled |
+--------------------------------------------------------------------------------------------------------------------------------+
|[0.6988226555566203,2.823544098357024,4.619943929849091,14.404034479212662,0.34413748792521537,6.934078206594095,0.0] |
|[0.6988226555566203,2.823544098357024,4.619943929849091,14.404034479212662,0.34413748792521537,6.934078206594095,0.0] |
|[2.795290622226481,3.952961737699833,4.975324232145176,3.086578816974142,6.194474782653876,7.585345220473099,3.808824114743216] |
|[2.795290622226481,0.6705917233597931,4.975324232145176,3.086578816974142,8.25929971020517,7.585345220473099,3.808824114743216] |
|[2.795290622226481,0.6705917233597931,4.975324232145176,3.086578816974142,7.915162222279953,8.016330744363616,3.808824114743216]|
+--------------------------------------------------------------------------------------------------------------------------------+
I want to convert each elements in the list in to individual columns. I cannot use explode because I want each value in the list in individual columns. Explode creates different rows for each elements in the list.
I want the output in this format -
+-----+-----+-----+------+-----+-----+---+
| no1| no2| no3| no4| no5| no6|no7|
+-----+-----+-----+------+-----+-----+---+
|0.691|2.878|6.107|14.489|0.394|3.381|0.0|
|0.691|2.878|6.107|14.489|0.394|3.381|0.0|
|0.691|2.878|6.107|14.489|0.394|3.381|0.0|
+-----+-----+-----+------+-----+-----+---+
I thought of converting the dataframe in to rdd and then to map the elements in the list to individual columns. Here is the code that I tried to do that
rdd01 = df08.select('scaled').rdd
rdd01 = (rdd01.map(lambda x:(round(float(x[0][0]),3),round(float(x[0][1]),3), round(float(x[0][2]),3), round(float(x[0][3]),3),
round(float(x[0][4]),3), round(float(x[0][5]),3), round(float(x[0][6]),3))))
df = rdd01.toDF(["no1", "no2", "no3", "no4", "no5", "no6", "no7"])
But I am getting this error when I try this
An error was encountered:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 983.0 failed 4 times, most recent failure: Lost task 0.3 in stage 983.0 (TID 38376, ip-10-0-141-183.us-west-2.compute.internal, executor 21): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/serializers.py", line 271, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1440, in takeUpToNumLeft
File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/util.py", line 121, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 1, in <lambda>
File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/sql/functions.py", line 648, in round
return Column(sc._jvm.functions.round(_to_java_column(col), scale))
AttributeError: 'NoneType' object has no attribute '_jvm'