Convert scala FP-growth RDD output to Data frame

Question

https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth

sample_fpgrowth.txt can be found here, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt

I ran the FP-growth example in the link above in scala its working fine, but what i need is, how to convert the result which is in RDD to data frame. Both these RDD

 model.freqItemsets and 
 model.generateAssociationRules(minConfidence)

explain that in detail with the example given in my question.

Possible duplicate of [How to convert rdd object to dataframe in spark](https://stackoverflow.com/questions/29383578/how-to-convert-rdd-object-to-dataframe-in-spark) — stefanobaghino, May 30 '17 at 12:57
I tried that i got error, might be as in am new to scala. Can u explain that in detail with the example given in my question. — Arun Gunalan, May 30 '17 at 13:14
@zero323 Can u help me in expaining with the example given in my question — Arun Gunalan, May 31 '17 at 03:53
@ArunGunalan are you sure the link you provided has the example you want to be explained? — Ramesh Maharjan, May 31 '17 at 23:40
@Ramesh Maharjan , sorry i have given a wrong link, i have edited to correct link thanks — Arun Gunalan, Jun 01 '17 at 04:29

score 3 · Accepted Answer · answered Jun 01 '17 at 12:21

There many ways to create a dataframe once you have a rdd. One of them is to use .toDF function which requires sqlContext.implicits library to be imported as

val sparkSession = SparkSession.builder().appName("udf testings")
  .master("local")
  .config("", "")
  .getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._

After that you read the fpgrowth text file and covert into an rdd

    val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
    val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

I have used the code from Frequent Pattern Mining - RDD-based API that is provided in the question

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10)
val model = fpg.run(transactions)

Next step would be to call .toDF functions

For the first dataframe

model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)

this will result to

+---------+----+
|items    |freq|
+---------+----+
|[z]      |5   |
|[x]      |4   |
|[x,z]    |3   |
|[y]      |3   |
|[y,x]    |3   |
|[y,x,z]  |3   |
|[y,z]    |3   |
|[r]      |3   |
|[r,x]    |2   |
|[r,z]    |2   |
|[s]      |3   |
|[s,y]    |2   |
|[s,y,x]  |2   |
|[s,y,x,z]|2   |
|[s,y,z]  |2   |
|[s,x]    |3   |
|[s,x,z]  |2   |
|[s,z]    |2   |
|[t]      |3   |
|[t,y]    |3   |
+---------+----+
only showing top 20 rows

for the second dataframe

val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
  .map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
  .toDF("antecedent", "consequent", "confidence").show(false)

which will result to

+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y]   |[x]       |1.0       |
|[t,s,y]   |[z]       |1.0       |
|[y,x,z]   |[t]       |1.0       |
|[y]       |[x]       |1.0       |
|[y]       |[z]       |1.0       |
|[y]       |[t]       |1.0       |
|[p]       |[r]       |1.0       |
|[p]       |[z]       |1.0       |
|[q,t,z]   |[y]       |1.0       |
|[q,t,z]   |[x]       |1.0       |
|[q,y]     |[x]       |1.0       |
|[q,y]     |[z]       |1.0       |
|[q,y]     |[t]       |1.0       |
|[t,s,x]   |[y]       |1.0       |
|[t,s,x]   |[z]       |1.0       |
|[q,t,y,z] |[x]       |1.0       |
|[q,t,x,z] |[y]       |1.0       |
|[q,x]     |[y]       |1.0       |
|[q,x]     |[t]       |1.0       |
|[q,x]     |[z]       |1.0       |
+----------+----------+----------+
only showing top 20 rows

I hope this is what you require

My pleasure @ArunGunalan :) Glad that the answer helped you – Ramesh Maharjan Jun 02 '17 at 04:52 — Ramesh Maharjan, Jun 02 '17 at 04:52

Convert scala FP-growth RDD output to Data frame

1 Answers1