Generate candidate itemsets based on Apriori algorithm

Question

I'm trying to implement Apriori Algorithm. For that, I need to generate itemsets of length k+1 from itemsets of length k (given as a dictionary L). The Apriori principle must be followed in generating the combinations. The principle states: A set of length k+1 can only be generated if ALL its subsets are present in the input, L.

I have a dictionary from which I need to generate itemsets.

My current attempt is this:

import itertools as it
def generateItemsets(Lk,k):

    comb = sum(Lk.keys(), tuple())
    Ck = set(it.combinations(comb, k))
    return Ck

But the function takes forever and get interrupted at the error : IOPub data rate exceeded.

Example-1:

Input (dictionary): {(150,): 2, (160,): 3, (170,): 3, (180,): 3}

Output (set): {(150, 160), (150, 170), (150, 180), (160, 170), (160, 180), (170, 180)}

Update-1

The dataset contains almost 16000 transactions. It looks like this:

[![Dataset][1]][1]

The unique items range from 0-999

As you can see, this function will be given an input L_k and it should output C_k+1. Input L_k is a dictionary like ({(301,350): 46, (966,970): 612, (310,350): 216, (548, 550): 457}) while the output C_k+1 should be a set (for example: {(250,350),(360,370),(380,390),...}

Have you already checked [this](https://github.com/tommyod/Efficient-Apriori) library ? — null, Feb 26 '21 at 14:25
@Ashar I was just adding the explanation you provided in the bounty comment into the question itself. Otherwise, the question could be closed as lacking details and clarity. — Sabito stands with Ukraine, Feb 26 '21 at 14:40
@KacperFloriański Added my current attempt and made it more clear — Ashar, Feb 26 '21 at 17:09
For Apriori Algorithms you need a database with items and a database with transactions. You did not provide any of those. Also your code attempt is fairly incomplete. To get solid answers you should provide more info. — areop-enap, Feb 26 '21 at 17:24
I assume, the input `k` of your function is the current ongoing transaction and looks similar to one of the keys of `Lk` like `(301,350)` ? — areop-enap, Feb 27 '21 at 10:34
I also assume, the key values are the item numbers and the value they are pointing at is the transaction number ? When I pass one of your two example dictionaries to your function I get `TypeError: 'tuple' object cannot be interpreted as an integer` error. Seemingly I can't reproduce the error you get with your code. What Python version are you using ? — areop-enap, Feb 27 '21 at 10:43

score 0 · Answer 1 · answered Mar 07 '21 at 02:57

I am not sure what input you exactly want, because it is somehow unclear how the lists you posted are fitting the definition of input for an Apriori algorithm.
The input should be a list of transactions, an item within these transactions and a number which represents the count of certain items that come together with the specified item within the same transaction.
The output is a list of items which have been sold together with the specified item the wanted number of times.
There is a couple of libraries for this kind of problem. The user null already pointed a good one out: https://github.com/tommyod/Efficient-Apriori. There is also Apyori: https://github.com/ymoch/apyori.
Here is a simple attempt of solving the Apriori algorithm. It can be copied to a file and executed with Python:

# list of transactions
sales = [
  ('eggs', 'bacon', 'soup'),
  ('eggs', 'bacon', 'apple'),
  ('soup', 'bacon', 'banana'),
]

# generate match dictionary of type {item: {count: {item, ...}, ...}, ...}
matches = {
  i: { 
    sum((i in z and j in z) for z in sales): set(
      k for t in sales for k in t
      if i!=k and
      sum((i in z and j in z) for z in sales) == sum((i in z and k in z) for z in sales)
    )
    for t in sales for j in t if i in t and j!=i
  }
  for t in sales for i in t
}

#print ( "match counts: %s\n" % (matches) )

print ( "best match(es) for eggs:", matches['eggs'][len(matches['eggs'])] )
# output: {'bacon'}
print ( "best match(es) for bacon:", matches['bacon'][len(matches['bacon'])] )
# output: {'eggs', 'soup'}

basket = ('soup', 'apple', 'banana') # consumer basket

# calculate a list of best matches for new sales
best = set(sum([ list(matches[i][len(matches[i])]) for i in basket ], [])) - set(basket)

print ( "basket: %s, best matches: %s" % ( basket, best ) )
# output: {'bacon', 'eggs'}

The above code generates a dictionary of items with a list of certain counts for certain items within transactions which contain both items. The generation of this dictionary might be slow for huge transaction lists. But you don't have to calculate this for every new transaction. Instead I would recalculate the match count every now and then per day.
The item names can be replaced with item indices to address item datasets. In this example the strings are more clear than just numbers.
In general turning slow functions to nested dictionaries of datasets is a good idea to speed up code. A slow function of type:

result = function ( parameter, parameter, ... )

Can be turned to a nested dictionary and a function that recalculates the dictionary after a longer period of time:

if time < refresh:
  dictionary = precalc ( )
  refresh = time + rate
...
result = dictionary [ parameter ] [ parameter ] [ ... ]

This solution requires more memory of course.

In order to get solid answers you should not downvote posts but instead provide a bigger chunk of code that can be copied to a file and executes. You also should offer clear input values of your Function. What is Lk and what is k ? Accordingly to your question I assumed the following program which does not output the error you posted:

import itertools as it
def generateItemsets(Lk,k):

    comb = sum(Lk.keys(), tuple())
    Ck = set(it.combinations(comb, k))
    return Ck

# input of apriori algorithm should be a list of transactions, wtf is this ?!
Lk = {(150,): 2, (160,): 3, (170,): 3, (180,): 3}
missing_input_value = 1234567890

print ( generateItemsets ( Lk, missing_input_value ) )
# output: set()

for i in range(0,999999):
  generateItemsets ( Lk, i )   # does not error out

So you either messed up your Python version or I misunderstood your question or the input you provided does not cover the faulty situation of your program.
I would recommend you update your question with a bigger piece of code and not just three lines of function without any working input.
When you are using a Jupyter notebook, the error you get might have to do something with your output data rate. Try executing jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10 in console, which is from this post: How to solve "IOPub data rate exceeded." in Jupyter Notebook
or this video: https://www.youtube.com/watch?v=B_YlLf6fa5A

Generate candidate itemsets based on Apriori algorithm

1 Answers1