2

I want add a new column in my existing dataframe. Below is my dataframe -

+---+---+-----+
| x1| x2|   x3|
+---+---+-----+
|  1|  a| 23.0|
|  3|  B|-23.0|
+---+---+-----+

I am able to add df = df.withColumn("x4", lit(0)) like this

+---+---+-----+---+
| x1| x2|   x3| x4|
+---+---+-----+---+
|  1|  a| 23.0|  0|
|  3|  B|-23.0|  0|
+---+---+-----+---+

but I want to add a array list to my df.

Supose this [0,0,0,0] is my array to add and after adding my df will look like this -

+---+---+-----+---------+
| x1| x2|   x3|       x4|
+---+---+-----+---------+
|  1|  a| 23.0|[0,0,0,0]|
|  3|  B|-23.0|[0,0,0,0]|
+---+---+-----+---------+

I tried like this -

array_list = [0,0,0,0]
df = df.withColumn("x4", lit(array_list))

But it is giving error

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [0, 0, 0, 0, 0, 0]

Do anybody know how to do this?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
abhjt
  • 402
  • 4
  • 11
  • 25
  • Perhaps `df.withColumn("some_array", array(lit(0), lit(0), lit(0), lit(0))` ? [src](https://stackoverflow.com/a/32788650/3433323) – mkaran Jan 22 '18 at 10:18
  • But what if I have to add different value to different row. It is not permanent solution. – abhjt Jan 22 '18 at 10:21
  • If you need a different value to a different row then you possibly need to use a `udf`. – mkaran Jan 22 '18 at 10:23
  • Another thought is to use `when` : `df.withColumn('some_array', when((df.some_column==1), array(lit(0), lit(0), lit(0), lit(0)).otherwise(array(lit(1), lit(1), lit(1), lit(1))` – mkaran Jan 22 '18 at 10:49
  • 2
    what does your array depend on ? – Steven Jan 22 '18 at 10:58
  • My array is variable and I have to add it to multiple places with different value. This approach is fine for adding either same value or for adding one or two arrays. It will not suit for adding huge data like some 1000 rows. – abhjt Jan 22 '18 at 11:06

1 Answers1

1

Based on your comment

My array is variable and I have to add it to multiple places with different value. This approach is fine for adding either same value or for adding one or two arrays. It will not suit for adding huge data

I believe it an XY-problem. If you want scalable solution (1000 rows in not huge to be honest), then use another dataframe and join. For example if want to connect by x1

arrays = spark.createDataFrame([
    (1, [0.0, 0.0, 0.0]), (3, [0.0, 0.0, 0.0])
], ("x1", "x4"))


df.join(arrays, ["x1"])

Add more complex condition depending on the requirements.

To solve you're immediate problem see How to add a constant column in a Spark DataFrame? - all elements of array should be columns

from pyspark.sql.functions import lit

array(lit(0.0), lit(0.0), lit(0.0))
#  Column<b'array(0.0, 0.0, 0.0)'>
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Fine got the point. But one more question what if i want to add different values to each row like this - `+---+---+-----+---------+ | x1| x2| x3| x4| +---+---+-----+---------+ | 1| a| 23.0|[0,1,2,3]| | 3| B|-23.0|[4,5,0,7]| | 4| C|-23.0|[8,0,1,0]| +---+---+-----+---------+` – abhjt Jan 23 '18 at 06:14