Spark Scala - Split columns into multiple rows

Question

Following the question that I post here:

I've another one doubt... Is possible to transform a dataset like this:

2,1,3
1
3,6,8

Into this:

2,1
2,3
1,3
1
3,6
3,8
6,8

Basically I want to discover all the relationships between the movies. Is possible to do this?

My current code is:

val input = sc.textFile("PATH")
val raw = input.lines.map(_.split(",")).toArray
val twoElementArrays = raw.flatMap(_.combinations(2))
val result = twoElementArrays ++ raw.filter(_.length == 1)

Tomasz Błachut · Accepted Answer · 2016-09-05T13:44:06.853

2

Given that input is a multi-line string.

scala> val raw = input.lines.map(_.split(",")).toArray
raw: Array[Array[String]] = Array(Array(2, 1, 3), Array(1), Array(3, 6, 8))

Following approach discards one-element arrays, 1 in your example.

scala> val twoElementArrays = raw.flatMap(_.combinations(2))
twoElementArrays: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8))

It can be fixed by appending filtered raw collection.

scala> val result = twoElementArrays ++ raw.filter(_.length == 1)
result: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8), Array(1))

Order of combinations is not relevant I believe.

Update SparkContext.textFile returns RDD of lines, so it could be plugged in as:

val raw = rdd.map(_.split(","))

edited Sep 05 '16 at 13:44

answered Sep 05 '16 at 13:25

Tomasz Błachut

2,368
1
16
23

Hi Tomasz Blachut, many thanks!!! I'm getting error when I submut the raw variable... I update the code that I'm using. The error is: error: value lines is not a member of org.apache.spark.rdd.RDD[String] – SaCvP Sep 05 '16 at 13:35
@PedroRodgers Well yes, I've written that input is a multi-line string, not a RDD of lines. I'll update answer with solution coded by hand but I don't have Spark on this machine to test it. – Tomasz Błachut Sep 05 '16 at 13:40
1

Don't worry I just remove the .lines and it works ;) – SaCvP Sep 05 '16 at 13:41
@PedroRodgers Good :) If you want you can also fiddle with Set.subsets, either nullary overload or unary one, take a look at http://www.scala-lang.org/api/current/#scala.collection.Set and http://stackoverflow.com/a/13116344/1879175 – Tomasz Błachut Sep 05 '16 at 13:54
only one more thing: when I try to save the result to HDFS it gives me a file with [Ljava.lang.String;@44cc2153... do you what is this? – SaCvP Sep 05 '16 at 14:18
@PedroRodgers I remember some things very vaguely... What is the type of value you are trying to save? maybe run a toString or mkString, possibly as argument to .map on this thing. – Tomasz Błachut Sep 05 '16 at 14:55
I'm trying with this: result.mkString(",").saveAsTextFile("/user/cloudera/Output/Combinations") But I'm getting this error: error: value mkString is not a member of org.apache.spark.rdd.RDD[Array[String]] – SaCvP Sep 05 '16 at 16:43
@PedroRodgers maybe `result.map(_.mkString(","))` ? – Tomasz Błachut Sep 05 '16 at 16:47
No problem. So a takeaway from this problem is that Scala's internal implementation of `Array`, i.e. Java's `[]` array doesn't work well with `toString` method – Tomasz Błachut Sep 05 '16 at 17:00

Spark Scala - Split columns into multiple rows

1 Answers1