How to use Spark's PrefixSpan on real-world data (text file or sql)?

Question

I'm trying to use Spark's PrefixSpan algorithm but it is comically difficult to get the data in the right shape to feed to the algo. It feels like a Monty Python skit where the API is actively working to confuse the programmer.

My data is a list of rows, each of which contains a list of text items.

a b c c c d 
b c d e
a b
...

I have made this data available two ways, an sql table in Hive (where each row has an array of items) and text files where each line contains the items above.

The official example creates a Seq of Array(Array).

If I use sql, I get the following type back:

org.apache.spark.sql.DataFrame = [seq: array<string>]

If I read in text, I get this type:

org.apache.spark.sql.Dataset[Array[String]] = [value: array<string>]

Here is an example of an error I get (if I feed it data from sql):

error: overloaded method value run with alternatives:
  [Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
  [Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
 cannot be applied to (org.apache.spark.sql.DataFrame)
       new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run( sql("select seq from sequences limit 1000") )
                                                                  ^

Here is an example if I feed it text files:

error: overloaded method value run with alternatives:
  [Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
  [Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
 cannot be applied to (org.apache.spark.sql.Dataset[Array[String]])
       new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run(textfiles.map( x => x.split("\u0002")).limit(3))
                                                                  ^

I've tried to mold the data by using casting and other unnecessarily complicated logic.

This can't be so hard. Given a list of items (of the very reasonable format described above), how the heck do I fed it to PrefixSpan?

edit: I'm on spark 2.2.1

Resolved: A column in the table I was querying had collections in each cell. This was causing the returned result to be inside a WrappedArray. I changed my query so the result column only contained a string (by concat_ws). This made it MUCH easier to deal with the type error.

You are trying to mix old (`RDD`) and new (`Dataset`) API . With `Dataset` you should use ML API ([What's the difference between Spark ML and MLLIB packages](https://stackoverflow.com/q/38835829/10465355)). Additionally the input should be `Array[Array[_]]` not `Array[_]` - see [ML docs](https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html#prefixspan) for example data. Additionally [MLllib docs](https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#prefixspan) explain meaning of this representation. — 10465355, Jan 11 '19 at 17:28
Possible duplicate of [PrefixSpan sequence extraction misunderstanding](https://stackoverflow.com/questions/40593218/prefixspan-sequence-extraction-misunderstanding) — 10465355, Jan 11 '19 at 17:30
@user10465355 That "extraction misunderstanding" comment is for python. The problem I'm having is trying to get Scala types to match what PrefixSpan expects — Shahbaz, Jan 11 '19 at 18:29

How to use Spark's PrefixSpan on real-world data (text file or sql)?

0 Answers0