I'm using Spark SQL to pull a rows from a table. Some of this data is recurring, and I'm trying to count the number of occurrences. In essence, I'm trying to perform the basic "word count" example, but instead of my data being of the form: (Word : String, Count : Int)
, we have a row of data replacing the word/string.
More specifically, my data looks like: RDD[((row), count)]
, where row is pulled from the sql table, and contains strings, doubles, ints, etc.
It is in RDD
form, because I want to use reduceByKey
. See: Avoid groupByKey. It is a (Key, Value)
pair with a very long key (some row from a sql database) and its value being the "word count".
My app is doing this:
myDataframe
// Append a 1 to each row
.map(row => (row, 1))
// Convert to RDD so we can use the reduceByKey method
.rdd
// Add up the 1's corresponding to matching keys
.reduceByKey(_ + _)
//Filter by rows that show up more than 10 times
.filter(_._2 > 100)
...
Now let's say my row data contains (string, double, int)
.
This is where I want to unpack my data from RDD[((string, double, int), count)]
to RDD[(string, double, int, count)]
so that I can eventually save this data to another SQL table.
Is there some method that allows me to unpack the contents of this ... nested tuple ... sort of thing?
My solution has been to "unpack" the elements of the RDD like so:
.map(row => (row._1._1, row._1._2, row._1._3, row._2))
But there must be a better way! If I decide to grab more elements from the row, I'd have to modify this .map()
call.
Thanks!