0

I'm trying to transform a Spark dataframe into a Scalar map and additionally a list of values.

It is best illustrated as follows:

val df = sqlContext.read.json("examples/src/main/resources/people.json")
df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
|  21|Michael|
+----+-------+

To a Scala collection (Map of Maps(List(values))) represented like this:

Map(
  (0 -> List(Map("age" -> null, "name" -> "Michael"), Map("age" -> 21, "name" -> "Michael"))),
  (1 -> Map("age" -> 30, "name" -> "Andy")),
  (2 -> Map("age" -> 19, "name" -> "Justin"))
)

As I don't know much about Scala, I wonder if this method is possible. It doesn't matter if it's not necessarily a List.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
  • You can certainly call collect on the DataFrame--although that's going to give a Collection of Row objects (which is probably not what you want). But the Row object itself is sort of a map. So if you wanted the "age" column you could call `row.getAs[Int]("age")`. – Jared Jul 08 '22 at 07:23
  • Take a look to this answer: https://stackoverflow.com/a/69583725/6802156 – Emiliano Martinez Jul 08 '22 at 07:30

1 Answers1

0

The data structure you want is actually useless. Let me explain what I mean by asking 2 questions:

    1. What is the purpose of the integers of the outside map? are those indices? What is the logic of those indices? If those are indices, why not just use Array?
    1. Why to use Map[String, Any] and do unsafe element accessing, while you can model into case classes?

So I think the best thing you can do would be this:

case class Person(name: String, age: Option[Int])
val persons = df.as[Person].collect
val personsByName: Map[String, Array[Person]] = persons.groupBy(_.name)

Result would be:

Map(
  Michael -> Array(Person(Michael, None), Person(Michael, Some(21)),
  Andy -> Array(Person(Andy, Some(30))),
  Justin -> Array(Person(Justin, Some(19)))
)

But still, if you insist on the data structure, this is the code you need to use:

val result: Map[Int, List[Map[String, Any]]] =
  persons.groupBy(_.name)       // grouping persons by name
  .zipWithIndex                 // coupling index with values of array
  .map { 
    case ((name, persons), index) => 
      // put index as key, map each person to the desired map
      index -> persons.map(p => Map("age" -> p.age, "name" -> p.name)).toList 
    }
AminMal
  • 3,070
  • 2
  • 6
  • 15