0

I am creating a pyspark dataframe by selecting a column from another dataframe and zipping it with index after converting to RDD and then back to DF as below:

df_tmp=o[1].select("value").rdd.zipWithIndex().toDF()

o[1] is a dataframe, value in o[1]:

+-----+
|value|
+-----+
|    0|
|    0|
|    0|
+-----+
o[1].printSchema()
root
 |-- value: integer (nullable = true)

In this process "value" is getting extra square braces as below:

+---+---+
| _1| _2|
+---+---+
|[0]|  0|
|[0]|  1|
+---+---+

df_tmp.printSchema():
root
 |-- _1: struct (nullable = true)
 |    |-- value: long (nullable = true)
 |-- _2: long (nullable = true)

When writing to hive table: saveAsTable(), it's causing problems, as it's writing values as: "{"value":0}. However I just want value as: 0.

How can i get rid of the extra braces from this dataframe, so that I can get normal integer values while writing to hive table.

muni
  • 1,263
  • 4
  • 22
  • 31
  • it's throwing me error: u"Field name should be String Literal, but it's 0;" – muni Aug 07 '18 at 10:39
  • 1
    What about `df_tmp.withColumn("_1new", df_tmp._1.getItem(0))` ? Sorry, it's quite hard to reproduce your code without any knowledge about what `o`exactly is... Or somethin like `df_tmp.withColumn("_1new", df_tmp._1.value)` – dorvak Aug 07 '18 at 10:45
  • same error. o[1] is a dataframe – muni Aug 07 '18 at 10:48
  • See https://stackoverflow.com/questions/48062171/extracting-values-from-a-spark-column-containing-nested-values?rq=1 for a similar example – dorvak Aug 07 '18 at 10:49
  • 1
    yea, this worked: df_tmp.withColumn("_1new", df_tmp._1.value) – muni Aug 07 '18 at 11:13
  • quick question: 'value' is column name, or keyword? – muni Aug 07 '18 at 11:15

1 Answers1

0

(Writing this as an answer instead of a comment):

df_tmp.withColumn("_1new", df_tmp._1.value)

This will create a new column named "_1new" including the "value" (column) of the struct.

dorvak
  • 9,219
  • 4
  • 34
  • 43