There many ways to create a dataframe
once you have a rdd
. One of them is to use .toDF
function which requires sqlContext.implicits
library to be imported
as
val sparkSession = SparkSession.builder().appName("udf testings")
.master("local")
.config("", "")
.getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
After that you read the fpgrowth
text file and covert into an rdd
val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
I have used the code from Frequent Pattern Mining - RDD-based API that is provided in the question
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions)
Next step would be to call .toDF
functions
For the first dataframe
model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)
this will result to
+---------+----+
|items |freq|
+---------+----+
|[z] |5 |
|[x] |4 |
|[x,z] |3 |
|[y] |3 |
|[y,x] |3 |
|[y,x,z] |3 |
|[y,z] |3 |
|[r] |3 |
|[r,x] |2 |
|[r,z] |2 |
|[s] |3 |
|[s,y] |2 |
|[s,y,x] |2 |
|[s,y,x,z]|2 |
|[s,y,z] |2 |
|[s,x] |3 |
|[s,x,z] |2 |
|[s,z] |2 |
|[t] |3 |
|[t,y] |3 |
+---------+----+
only showing top 20 rows
for the second dataframe
val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
.map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
.toDF("antecedent", "consequent", "confidence").show(false)
which will result to
+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y] |[x] |1.0 |
|[t,s,y] |[z] |1.0 |
|[y,x,z] |[t] |1.0 |
|[y] |[x] |1.0 |
|[y] |[z] |1.0 |
|[y] |[t] |1.0 |
|[p] |[r] |1.0 |
|[p] |[z] |1.0 |
|[q,t,z] |[y] |1.0 |
|[q,t,z] |[x] |1.0 |
|[q,y] |[x] |1.0 |
|[q,y] |[z] |1.0 |
|[q,y] |[t] |1.0 |
|[t,s,x] |[y] |1.0 |
|[t,s,x] |[z] |1.0 |
|[q,t,y,z] |[x] |1.0 |
|[q,t,x,z] |[y] |1.0 |
|[q,x] |[y] |1.0 |
|[q,x] |[t] |1.0 |
|[q,x] |[z] |1.0 |
+----------+----------+----------+
only showing top 20 rows
I hope this is what you require