-1

Given a deeply nested parquet struct like so

|-- bet: struct (nullable = true)
|    |-- sides: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- side: string (nullable = true)
|    |    |    |-- betID: string (nullable = true)
|    |    |    |-- secondarybetID: string (nullable = true)
|    |    |    |-- parties: struct (nullable = true)
|    |    |    |    |-- partyIDs: array (nullable = true)
|    |    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |    |-- partyID: string (nullable = true)
|    |    |    |    |    |    |-- partyRole: integer (nullable = true)
|    |    |    |    |    |    |-- partySubGrp: struct (nullable = true)
|    |    |    |    |    |    |    |-- partySubIDs: array (nullable = true)
|    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |    |    |    |    |-- partySubID: string (nullable = true)
|    |    |    |    |    |    |    |    |    |-- partySubIDType: integer (nullable = true)

And given the fact there are several sides to a bet and somehow we are interested only in the first side in sides array. How can I find the parties involved in the side whose partyRole is 10 ?

in prestosql i can do something like

 SELECT 
        filter(bet.sides[1].parties.partyids, x -> x.partyrole=10)[1] as party10
    FROM 
        parquetbets
    WHERE 
        cardinality(filter(bet.sides[1].parties.partyids, x -> x.partyrole=10))>0

How do I go about doing the same in spark2 sql ?

 SELECT bet.sides[1] from parquetbets 

In spark2 sql the above returns an array with no scope of further pruning on the nested structure ?

i.e.

 SELECT bet.sides[1].parties from parquetbets

returns null. I have tried out a few combination but the results return WrappedArrayElements which do not provide a mechanism to query the nested data. In prestosql results returned contain the field names so that it is easy to continue and probe deeper into the structure.

Can someone point me on how spark2 sql can support this ? And if spark2 sql cannot then how spark dataframes could do this ?

Manfred Moser
  • 29,539
  • 13
  • 92
  • 123
sunny
  • 824
  • 1
  • 14
  • 36

1 Answers1

0

Silly question : Have you considered using the DataSet API with encoders ? It provides a functional API to reason on your problem (which is a way easier to solve functionaly).

Otherwise consider exploding your arrays to reason on flatten data (see org.apache.spark.sql.functions.explode).

Example in scala :

  case class PartyId(partyID: String, partyRole: Int)
  case class Party(partyIDs: Seq[PartyId])
  case class Side(side: String, betId: String, parties: Party)
  case class Bet(sides: Seq[Side])

  import spark.implicits._
  val ds = spark.read.load("my-bets.parquet").as[Bet]

  val firstSidesDS = ds.flatMap(_.sides.headOption) //take the first side if exists

  val result: Dataset[Side] = firstSidesDS.filter(_.parties.partyIDs.exists(_.partyRole == 10)) //Here I return sides for which there is at least a partyRole = 10
baitmbarek
  • 2,440
  • 4
  • 18
  • 26
  • Can you give an example on how to do that ? – sunny Nov 16 '19 at 17:19
  • are you using Scala, Java or Python ? – baitmbarek Nov 16 '19 at 17:31
  • Just added an example with Scala and Dataset API – baitmbarek Nov 16 '19 at 17:40
  • Using spark sql and Java. I wanted to demonstrate sparksqls capability to query parquet data by just writing sql. In this case it seems it may have fallen short. Prestosql worked as expected. I am kind of disappointed that sparksql returns wrapped arrays which don't help in further 'discovering' data. Thanks will try out what you have suggested though – sunny Nov 16 '19 at 18:12
  • this would cause me to describe the schema again using the encoders which seems counter-intuitive to having the schema already ? – sunny Nov 16 '19 at 20:51
  • Plan B (non trivial one): define an udf with a recursive function with some Pattern Matching (based on the fact a struct is a Row). Can you share some mocked data ? – baitmbarek Nov 16 '19 at 22:59