What I'm trying to achieve is, for the following DataFrame:
-------------------------
| FOO | BAR | BAZ |
| lorem | ipsum | dolor |
| sit | amet | dolor |
| lorem | lorem | dolor |
-------------------------
Generate the following output:
Map(
FOO -> List("lorem", "sit"),
BAR -> List("ipsum", "amet", "lorem"),
BAZ -> List("dolor")
)
This is the Scala code that I've come up with:
val df = data.distinct
df.columns.map((key) => {
val distinctValues = df
.select(col(key))
.collect
.map(df => df.getString(0))
.toList
.distinct
(key, distinctValues)
}).toMap
I have tried a close alternative to this code using RDDs, and somehow they're about 30% faster, but the problem remains the same: this all is extraordinarily inefficient.
I am running Spark locally against a local Cassandra hosting a sample dataset of only 1000 rows, but these operations generate tons and tons of logs and take more then 7 seconds to complete.
Am I doing something wrong, is there a better way of doing this?