1

Spark FPGrowth works well with millions of transactions (records) when the frequent items in the Frequent Itemset is less than 25. Beyond 25 it runs into computational limit (executor computing time keeps growing). For 40+ items in the Frequent Itemset the process never return.

To reproduce, we created a simple data set of 3 transactions with equal items (40 of them) and run FPgrowth with 0.9 support, the process never completes. The run is on a local mode with 4 core, 32GB and a very small input dataset.

Below is a sample data we have used to narrow down the problem:

enter image description here

While the computation grows (2^n -1) with each item in Frequent Itemset, it surely should be able to handle 40 or more items in a Frequest Itemset.

Is this a FPGrowth implementation limitation, are there any tuning parameters that I am missing? Thank you.

0 Answers0