4

The below code works in Scala-Spark.

scala> val ar = Array("oracle", "java")
ar: Array[String] = Array(oracle, java)

scala> df.withColumn("tags", lit(ar)).show(false)
+------+---+----------+----------+--------------+
|name  |age|role      |experience|tags          |
+------+---+----------+----------+--------------+
|John  |25 |Developer |2.56      |[oracle, java]|
|Scott |30 |Tester    |5.2       |[oracle, java]|
|Jim   |28 |DBA       |3.0       |[oracle, java]|
|Mike  |35 |Consultant|10.0      |[oracle, java]|
|Daniel|26 |Developer |3.2       |[oracle, java]|
|Paul  |29 |Tester    |3.6       |[oracle, java]|
|Peter |30 |Developer |6.5       |[oracle, java]|
+------+---+----------+----------+--------------+

How do I get the same behavior in PySpark? I tried the below, but it doesn't work and throws Java error.

from pyspark.sql.types import *

tag = ["oracle", "java"]
df2.withColumn("tags", lit(tag)).show()

: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [oracle, java]

ZygD
  • 22,092
  • 39
  • 79
  • 102
stack0114106
  • 8,534
  • 3
  • 13
  • 38
  • Does this answer your question? [How to add a constant column in a Spark DataFrame?](https://stackoverflow.com/questions/32788322/how-to-add-a-constant-column-in-a-spark-dataframe) – blackbishop Dec 30 '19 at 14:18
  • @blackbishop no that explains only scala.. not pyspark – stack0114106 Dec 30 '19 at 14:31
  • In pyspark, you should be using `tag = [lit("oracle"), lit("java")] df2.withColumn("tags", array(*tag)).show()` as explained in the accepted answer – blackbishop Dec 30 '19 at 14:41
  • Does this answer your question? [Combine PySpark DataFrame ArrayType fields into single ArrayType field](https://stackoverflow.com/questions/37284077/combine-pyspark-dataframe-arraytype-fields-into-single-arraytype-field) – thebluephantom Dec 30 '19 at 14:48

4 Answers4

5

You can import array from functions module

>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import array

>>> tag=array(lit("oracle"),lit("java")
>>> df2.withColumn("tags",tag).show()

Tested below

>>> from pyspark.sql.functions import array

>>> tag=array(lit("oracle"),lit("java"))
>>> 
>>> ranked.withColumn("tag",tag).show()
+------+--------------+----------+-----+----+----+--------------+               
|gender|    ethinicity|first_name|count|rank|year|           tag|
+------+--------------+----------+-----+----+----+--------------+
|  MALE|      HISPANIC|    JAYDEN|  364|   1|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
|  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
|  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
|  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
|  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
|  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
|  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
|  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
|  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
|  MALE|      HISPANIC|     ANGEL|  236|  18|2012|[oracle, java]|
|  MALE|      HISPANIC|     AIDEN|  235|  19|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    DANIEL|  232|  20|2012|[oracle, java]|
+------+--------------+----------+-----+----+----+--------------+
only showing top 20 rows
Strick
  • 1,512
  • 9
  • 15
3

I found the below list comprehension to work

>>> arr=["oracle","java"]
>>> mp=[ (lambda x:lit(x))(x) for x in arr ]
>>> df.withColumn("mk",array(mp)).show()
+------+---+----------+----------+--------------+
|  name|age|      role|experience|            mk|
+------+---+----------+----------+--------------+
|  John| 25| Developer|      2.56|[oracle, java]|
| Scott| 30|    Tester|       5.2|[oracle, java]|
|   Jim| 28|       DBA|       3.0|[oracle, java]|
|  Mike| 35|Consultant|      10.0|[oracle, java]|
|Daniel| 26| Developer|       3.2|[oracle, java]|
|  Paul| 29|    Tester|       3.6|[oracle, java]|
| Peter| 30| Developer|       6.5|[oracle, java]|
+------+---+----------+----------+--------------+

>>>
stack0114106
  • 8,534
  • 3
  • 13
  • 38
0

There is difference between ar declare in scala and tag declare in python. ar is array type but tag is List type and lit does not allow List that's why it is giving error.

You need to install numpy to declare array like below

import numpy as np
tag = np.array(("oracle","java"))

Just for reference if you use List in scala, it will also give error

scala> val ar = List("oracle","java")
ar: List[String] = List(oracle, java)

scala> df.withColumn("newcol", lit(ar)).printSchema
java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(oracle, java)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at scala.util.Try.getOrElse(Try.scala:79)
  at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
  at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
  at org.apache.spark.sql.functions$.lit(functions.scala:110)
Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24
  • no it is giving error..````tag=["oracle","java"]; tag2=np.array(tag)```` works but ````df.withColumn("tag",lit(tag2))```` again throws error – stack0114106 Dec 30 '19 at 14:30
  • why are you using tag2=np.array(tag) you should use tag = np.array(("oracle","java")) as i had mentioned. – Nikhil Suthar Dec 31 '19 at 05:52
0

Spark 3.4+

F.lit(["oracle", "java"])

Full example:

from pyspark.sql import functions as F

df = spark.range(5)
df = df.withColumn("tags", F.lit(["oracle", "java"]))

df.show()
# +---+--------------+
# | id|          tags|
# +---+--------------+
# |  0|[oracle, java]|
# |  1|[oracle, java]|
# |  2|[oracle, java]|
# |  3|[oracle, java]|
# |  4|[oracle, java]|
# +---+--------------+
ZygD
  • 22,092
  • 39
  • 79
  • 102