I'm trying to use Spark's PrefixSpan algorithm but it is comically difficult to get the data in the right shape to feed to the algo. It feels like a Monty Python skit where the API is actively working to confuse the programmer.
My data is a list of rows, each of which contains a list of text items.
a b c c c d
b c d e
a b
...
I have made this data available two ways, an sql table in Hive (where each row has an array of items) and text files where each line contains the items above.
The official example creates a Seq
of Array(Array)
.
If I use sql, I get the following type back:
org.apache.spark.sql.DataFrame = [seq: array<string>]
If I read in text, I get this type:
org.apache.spark.sql.Dataset[Array[String]] = [value: array<string>]
Here is an example of an error I get (if I feed it data from sql):
error: overloaded method value run with alternatives:
[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
[Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
cannot be applied to (org.apache.spark.sql.DataFrame)
new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run( sql("select seq from sequences limit 1000") )
^
Here is an example if I feed it text files:
error: overloaded method value run with alternatives:
[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
[Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
cannot be applied to (org.apache.spark.sql.Dataset[Array[String]])
new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run(textfiles.map( x => x.split("\u0002")).limit(3))
^
I've tried to mold the data by using casting and other unnecessarily complicated logic.
This can't be so hard. Given a list of items (of the very reasonable format described above), how the heck do I fed it to PrefixSpan
?
edit: I'm on spark 2.2.1
Resolved: A column in the table I was querying had collections in each cell. This was causing the returned result to be inside a WrappedArray. I changed my query so the result column only contained a string (by concat_ws). This made it MUCH easier to deal with the type error.