0

Lets say I am importing a flat file from HDFS into spark using something like the following:

val data = sc.textFile("hdfs://name_of_file.tsv").map(_.split('\t'))

This will produce an Array[Array[String]]. If I wanted an array of tuples I could do as referenced in this solution and map the elements to a tuple.

val dataToTuple = data.map{ case Array(x,y) => (x,y) }

But what if my input data has say, 100 columns? Is there a way in scala using some sort of wildcard to say

val dataToTuple = data.map{ case Array(x,y, ... ) => (x,y, ...) }

without having to write out 100 variable to match on?

I tried doing something like

val dataToTuple = data.map{ case Array(_) => (_) }

but that didn't seem to make much sense.

Community
  • 1
  • 1
o-90
  • 17,045
  • 10
  • 39
  • 63
  • Why would you want a tuple with 100 elements? Just use the array that `split` produces? – The Archetypal Paul May 12 '16 at 19:43
  • You can create a `Row` instead of a `Tuple` – Alberto Bonsanto May 12 '16 at 19:47
  • 1
    if you really need that - you can use Shapeless library: http://stackoverflow.com/a/19901310/1809978, but be aware that maximum size of tuple is limited in scala to 22 (last time I was checking it) + I believe, you still have to specify type per column. Besides, it might not be what you actually need – dk14 May 12 '16 at 19:48

1 Answers1

1

If your data-columns are homogenous (like Array of Strings) - tuple may not be a best solution to improve type-safety. All you can do is to fix the size of your array using sized list from Shapeless library:

How to require typesafe constant-size array in scala?

This is a right approach if your column's are unnamed. For instance, your row might be a representation of a vector in Euclidean space.

Otherwise (named columns, maybe different types), it's better to model it with a case class, but be aware of size restriction. This might help you to quickly map array (or its parts) to ADT: https://stackoverflow.com/a/19901310/1809978

Community
  • 1
  • 1
dk14
  • 22,206
  • 4
  • 51
  • 88