Convert a column with list of values to individual columns in pyspark

Question

I have a dataframe in this format

df08.select('scaled').show(5, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------+
|scaled                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------+
|[0.6988226555566203,2.823544098357024,4.619943929849091,14.404034479212662,0.34413748792521537,6.934078206594095,0.0]           |
|[0.6988226555566203,2.823544098357024,4.619943929849091,14.404034479212662,0.34413748792521537,6.934078206594095,0.0]           |
|[2.795290622226481,3.952961737699833,4.975324232145176,3.086578816974142,6.194474782653876,7.585345220473099,3.808824114743216] |
|[2.795290622226481,0.6705917233597931,4.975324232145176,3.086578816974142,8.25929971020517,7.585345220473099,3.808824114743216] |
|[2.795290622226481,0.6705917233597931,4.975324232145176,3.086578816974142,7.915162222279953,8.016330744363616,3.808824114743216]|
+--------------------------------------------------------------------------------------------------------------------------------+

I want to convert each elements in the list in to individual columns. I cannot use explode because I want each value in the list in individual columns. Explode creates different rows for each elements in the list.

I want the output in this format -

+-----+-----+-----+------+-----+-----+---+
|  no1|  no2|  no3|   no4|  no5|  no6|no7|
+-----+-----+-----+------+-----+-----+---+
|0.691|2.878|6.107|14.489|0.394|3.381|0.0|
|0.691|2.878|6.107|14.489|0.394|3.381|0.0|
|0.691|2.878|6.107|14.489|0.394|3.381|0.0|
+-----+-----+-----+------+-----+-----+---+

I thought of converting the dataframe in to rdd and then to map the elements in the list to individual columns. Here is the code that I tried to do that

rdd01 = df08.select('scaled').rdd
rdd01 = (rdd01.map(lambda x:(round(float(x[0][0]),3),round(float(x[0][1]),3), round(float(x[0][2]),3), round(float(x[0][3]),3), 
                                                 round(float(x[0][4]),3), round(float(x[0][5]),3), round(float(x[0][6]),3))))

df = rdd01.toDF(["no1", "no2", "no3", "no4", "no5", "no6", "no7"])

But I am getting this error when I try this

An error was encountered:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 983.0 failed 4 times, most recent failure: Lost task 0.3 in stage 983.0 (TID 38376, ip-10-0-141-183.us-west-2.compute.internal, executor 21): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/serializers.py", line 271, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1440, in takeUpToNumLeft
  File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/util.py", line 121, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 1, in <lambda>
  File "/mnt/yarn/usercache/livy/appcache/application_1650779487763_0003/container_1650779487763_0003_01_000029/pyspark.zip/pyspark/sql/functions.py", line 648, in round
    return Column(sc._jvm.functions.round(_to_java_column(col), scale))
AttributeError: 'NoneType' object has no attribute '_jvm'

Does this answer your question? [Dataframe explode list columns in multiple rows](https://stackoverflow.com/questions/71441664/dataframe-explode-list-columns-in-multiple-rows) — crissal, Apr 24 '22 at 06:43

score 1 · Answer 1 · answered Apr 24 '22 at 17:25

Assuming you have all array in the same length, you can select by a loop like this

from pyspark.sql import functions as F

df = spark.createDataFrame([
    (1, [10, 20]),
    (2, [30, 40]),
    (3, [40, 50]),
], ['id', 'arr'])

# to get list's size
l = len(df.first()['arr'])

df.select([F.col('arr')[i].alias(f'arr_{i}') for i in range(l)]).show()
+-----+-----+
|arr_0|arr_1|
+-----+-----+
|   10|   20|
|   30|   40|
|   40|   50|
+-----+-----+

Convert a column with list of values to individual columns in pyspark

1 Answers1