how to remove square braces from pyspark dataframe column values

Question

I am creating a pyspark dataframe by selecting a column from another dataframe and zipping it with index after converting to RDD and then back to DF as below:

df_tmp=o[1].select("value").rdd.zipWithIndex().toDF()

o[1] is a dataframe, value in o[1]:

+-----+
|value|
+-----+
|    0|
|    0|
|    0|
+-----+
o[1].printSchema()
root
 |-- value: integer (nullable = true)

In this process "value" is getting extra square braces as below:

+---+---+
| _1| _2|
+---+---+
|[0]|  0|
|[0]|  1|
+---+---+

df_tmp.printSchema():
root
 |-- _1: struct (nullable = true)
 |    |-- value: long (nullable = true)
 |-- _2: long (nullable = true)

When writing to hive table: saveAsTable(), it's causing problems, as it's writing values as: "{"value":0}. However I just want value as: 0.

How can i get rid of the extra braces from this dataframe, so that I can get normal integer values while writing to hive table.

it's throwing me error: u"Field name should be String Literal, but it's 0;" — muni, Aug 07 '18 at 10:39
What about `df_tmp.withColumn("_1new", df_tmp._1.getItem(0))` ? Sorry, it's quite hard to reproduce your code without any knowledge about what `o`exactly is... Or somethin like `df_tmp.withColumn("_1new", df_tmp._1.value)` — dorvak, Aug 07 '18 at 10:45
See https://stackoverflow.com/questions/48062171/extracting-values-from-a-spark-column-containing-nested-values?rq=1 for a similar example — dorvak, Aug 07 '18 at 10:49
yea, this worked: df_tmp.withColumn("_1new", df_tmp._1.value) — muni, Aug 07 '18 at 11:13

score 0 · Answer 1 · answered Aug 07 '18 at 11:34

0

(Writing this as an answer instead of a comment):

df_tmp.withColumn("_1new", df_tmp._1.value)

This will create a new column named "_1new" including the "value" (column) of the struct.

answered Aug 07 '18 at 11:34

dorvak

9,219
4
34
43

how to remove square braces from pyspark dataframe column values

1 Answers1