Array to Tuple in Spark with many input variables

Question

Lets say I am importing a flat file from HDFS into spark using something like the following:

val data = sc.textFile("hdfs://name_of_file.tsv").map(_.split('\t'))

This will produce an Array[Array[String]]. If I wanted an array of tuples I could do as referenced in this solution and map the elements to a tuple.

val dataToTuple = data.map{ case Array(x,y) => (x,y) }

But what if my input data has say, 100 columns? Is there a way in scala using some sort of wildcard to say

val dataToTuple = data.map{ case Array(x,y, ... ) => (x,y, ...) }

without having to write out 100 variable to match on?

I tried doing something like

val dataToTuple = data.map{ case Array(_) => (_) }

but that didn't seem to make much sense.

Why would you want a tuple with 100 elements? Just use the array that `split` produces? — The Archetypal Paul, May 12 '16 at 19:43
if you really need that - you can use Shapeless library: http://stackoverflow.com/a/19901310/1809978, but be aware that maximum size of tuple is limited in scala to 22 (last time I was checking it) + I believe, you still have to specify type per column. Besides, it might not be what you actually need — dk14, May 12 '16 at 19:48

score 1 · Accepted Answer · edited May 23 '17 at 11:59

If your data-columns are homogenous (like Array of Strings) - tuple may not be a best solution to improve type-safety. All you can do is to fix the size of your array using sized list from Shapeless library:

How to require typesafe constant-size array in scala?

This is a right approach if your column's are unnamed. For instance, your row might be a representation of a vector in Euclidean space.

Otherwise (named columns, maybe different types), it's better to model it with a case class, but be aware of size restriction. This might help you to quickly map array (or its parts) to ADT: https://stackoverflow.com/a/19901310/1809978

Array to Tuple in Spark with many input variables

1 Answers1