1

I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row.I know i can use explode function. But, i have a problem, the column contains null value and i use spark 1.6. Here is a example of the type of data and what i want :
My data :

id | ListOfRficAction| RficActionAttachment
_______________________________
1  | Luke            | [baseball, soccer]
2  | Lucy            | null

and i want

id | ListOfRficAction| RficActionAttachment
_______________________________
1  | Luke            | baseball
1  | Luke            | soccer
2  | Lucy            | null

I am using Spark 1.6 (so i can't use explode_outer function) and i try to use explode but i have the following error :

 scala.MatchError: [null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

i also try :

df.withColumn("likes", explode(
  when(col("likes").isNotNull, col("likes"))
    // If null explode an array<string> with a single null
    .otherwise(array(lit(null).cast("string")))))

but my DataFrame schema is a lot complex (i have string and long), so cast function doesn't work. Here is a part of my schema and the error i have :

 |-- RficActionAttachment: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ActivityFileAutoUpdFlg: string (nullable = true)
 |    |    |-- ActivityFileDate: string (nullable = true)
 |    |    |-- ActivityFileDeferFlg: string (nullable = true)
 |    |    |-- ActivityFileDockReqFlg: string (nullable = true)
 |    |    |-- ActivityFileDockStatFlg: string (nullable = true)
 |    |    |-- ActivityFileExt: string (nullable = true)
 |    |    |-- ActivityFileName: string (nullable = true)
 |    |    |-- ActivityFileRev: string (nullable = true)
 |    |    |-- ActivityFileSize: long (nullable = true)
 |    |    |-- ActivityFileSrcPath: string (nullable = true)
 |    |    |-- ActivityFileSrcType: string (nullable = true)
 |    |    |-- ActivityId: string (nullable = true)
 |    |    |-- AttachmentId: string (nullable = true)
 |    |    |-- Comment: string (nullable = true)

User class threw exception:

org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN isnotnull(ListOfRficAction.RficAction.ListOfRficActionAttachment.RficActionAttachment) THEN ListOfRficAction.RficAction.ListOfRficActionAttachment.RficActionAttachment ELSE array(ListOfRficAction.RficAction.ListOfRficActionAttachment.RficActionAttachment)' 

due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;

Have idea of what i can do ?

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
Mbula Guy Marcel
  • 81
  • 1
  • 4
  • 11

1 Answers1

2

First replace all null values in the column will array(null), then use explode. Using the example dataframe in the question:

val df = Seq((1, "Luke", Array("baseball", "soccer")), (2, "Lucy", null))
  .toDF("id", "ListOfRficAction", "RficActionAttachment")

df.withColumn("RficActionAttachment", 
    when($"RficActionAttachment".isNull, array(lit(null)))
    .otherwise($"RficActionAttachment"))
  .withColumn("RficActionAttachment", explode($"RficActionAttachment"))

This will give the requested result:

+---+----------------+--------------------+
| id|ListOfRficAction|RficActionAttachment|
+---+----------------+--------------------+
|  1|            Luke|            baseball|
|  1|            Luke|              soccer|
|  2|            Lucy|                null|
+---+----------------+--------------------+
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • thak you @Shaido for answer but as i said i try this and i still have the same error : cannot resolve 'CASE WHEN isnull(ListOfRficAction.RficAction.ListOfRficActionAttachment.RficActionAttachment) THEN array(null) ELSE ListOfRficAction.RficAction.ListOfRficActionAttachment.RficActionAttachment' Maybe due to my dataframe schema – Mbula Guy Marcel Oct 23 '17 at 09:14
  • @MbulaGuyMarcel the dataframe schema shouldn't matter, the above should work if you have an array. Made a small update to the code, can you try again? – Shaido Oct 23 '17 at 09:43
  • am sorry , it doesn't work – Mbula Guy Marcel Oct 23 '17 at 12:16
  • here is my DF : *********************************isAttachment +--------+--------------------+--------------------------+ | AssetId| ListOfRficAction|ListOfRficActionAttachment| +--------+--------------------+--------------------------+ |1-KTOUUY|[WrappedArray([2,...| null| |1-KTOUUY|[WrappedArray([2,...| [WrappedArray([Y,...| +--------+--------------------+--------------------------+ and i try to flat ListOfRficActionAttachment – Mbula Guy Marcel Oct 23 '17 at 12:21
  • 1
    @MbulaGuyMarcel from your error message, it looks like the `.otherwise($"RficActionAttachment")` part in the code above does not give an array. It's is necessary to do it in two stages as above code. You could also try using ` df.withColumn("RficActionAttachment", explode_outer($"RficActionAttachment"))` which should achieve the same thing. – Shaido Oct 23 '17 at 13:48