pyspark--FPGrowth: how does transform work on unseen transactions?

Question

I am using pyspark.ml.fpm.FPGrowth in Spark 2.4 and I have a question about how precisely transform works on a transactions which are new.

My understanding is that model.transform will take each transaction X and find all Y such that Conf(X-->Y) > minConfidence. It will then return the list of such Y ordered by confidence.

However suppose there is no transaction which contains X, so Conf(X-->Y) is undefined for all Y, I am unsure how the algorithm will transform this transaction.

This is a simple set of transactions taken from the docs:

DF = spark.createDataFrame([
    (0, [1, 2, 5]),
    (1, [1, 2, 3, 5]),
    (2, [1, 4])
], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0, minConfidence=0)
model = fpGrowth.fit(DF)

Then we supply a simple transaction as test data:

test_DF = spark.createDataFrame([
    (0, [4,5])
], ["id", "items"])
test_DF = spark.createDataFrame(baskets, schema=schema)
model.transform(test_DF).show()

+---+------+----------+
|num| items|prediction|
+---+------+----------+
|  1|[4, 5]| [1, 3, 2]|
+---+------+----------+

Does anyone know how the prediction [1,3,2] was generated?

score 0 · Answer 1 · answered Jun 18 '20 at 09:52

I think FPGrowthModel.transform applies the rules mined by FPGrowth on the transactions, so when ever it finds an itemset X in a transaction and at the same time we have a rule that says (X=>Y) then it suggests the item Y in prediction column for this transaction, but the question know I noticed that in the case we have a transaction that contains X and Y it returns [ ] in prediction column unless there is a rule that says X & Y => Z in this case it will suggest Z instead. So that makes it hard to evaluate the model with accuracy metric :(

pyspark--FPGrowth: how does transform work on unseen transactions?

1 Answers1