0

Following the question that I post here:

Spark Mllib - Scala

I've another one doubt... Is possible to transform a dataset like this:

2,1,3
1
3,6,8

Into this:

2,1
2,3
1,3
1
3,6
3,8
6,8

Basically I want to discover all the relationships between the movies. Is possible to do this?

My current code is:

val input = sc.textFile("PATH")
val raw = input.lines.map(_.split(",")).toArray
val twoElementArrays = raw.flatMap(_.combinations(2))
val result = twoElementArrays ++ raw.filter(_.length == 1)
Community
  • 1
  • 1
SaCvP
  • 393
  • 2
  • 4
  • 16

1 Answers1

2

Given that input is a multi-line string.

scala> val raw = input.lines.map(_.split(",")).toArray
raw: Array[Array[String]] = Array(Array(2, 1, 3), Array(1), Array(3, 6, 8))

Following approach discards one-element arrays, 1 in your example.

scala> val twoElementArrays = raw.flatMap(_.combinations(2))
twoElementArrays: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8))

It can be fixed by appending filtered raw collection.

scala> val result = twoElementArrays ++ raw.filter(_.length == 1)
result: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8), Array(1))

Order of combinations is not relevant I believe.


Update SparkContext.textFile returns RDD of lines, so it could be plugged in as:
val raw = rdd.map(_.split(","))
Tomasz Błachut
  • 2,368
  • 1
  • 16
  • 23
  • Hi Tomasz Blachut, many thanks!!! I'm getting error when I submut the raw variable... I update the code that I'm using. The error is: error: value lines is not a member of org.apache.spark.rdd.RDD[String] – SaCvP Sep 05 '16 at 13:35
  • @PedroRodgers Well yes, I've written that input is a multi-line string, not a RDD of lines. I'll update answer with solution coded by hand but I don't have Spark on this machine to test it. – Tomasz Błachut Sep 05 '16 at 13:40
  • 1
    Don't worry I just remove the .lines and it works ;) – SaCvP Sep 05 '16 at 13:41
  • @PedroRodgers Good :) If you want you can also fiddle with Set.subsets, either nullary overload or unary one, take a look at http://www.scala-lang.org/api/current/#scala.collection.Set and http://stackoverflow.com/a/13116344/1879175 – Tomasz Błachut Sep 05 '16 at 13:54
  • only one more thing: when I try to save the result to HDFS it gives me a file with [Ljava.lang.String;@44cc2153... do you what is this? – SaCvP Sep 05 '16 at 14:18
  • @PedroRodgers I remember some things very vaguely... What is the type of value you are trying to save? maybe run a toString or mkString, possibly as argument to .map on this thing. – Tomasz Błachut Sep 05 '16 at 14:55
  • I'm trying with this: result.mkString(",").saveAsTextFile("/user/cloudera/Output/Combinations") But I'm getting this error: error: value mkString is not a member of org.apache.spark.rdd.RDD[Array[String]] – SaCvP Sep 05 '16 at 16:43
  • @PedroRodgers maybe `result.map(_.mkString(","))` ? – Tomasz Błachut Sep 05 '16 at 16:47
  • No problem. So a takeaway from this problem is that Scala's internal implementation of `Array`, i.e. Java's `[]` array doesn't work well with `toString` method – Tomasz Błachut Sep 05 '16 at 17:00