Count frequency of itemsets in the given data frame

Question

I have following data frame,

data = pd.read_csv('sample.csv', sep=',')

I need to search the frequency of itemsets present in a set. For example:

itemsets = {(143, 157), (143, 166), (175, 178), (175, 190)}

This should search the frequency of each tuple in the data frame (Trying to implement Apriori's algorithm). I'm particularly having trouble with how to individually address the tuples in the data frame and to search the tuple instead of individual entries in the data.

Update-1

For example data frame is like this:

39, 120, 124, 205, 401, 581, 704, 814, 825, 834
35, 39,  205, 712, 733, 759, 854, 950
39, 422, 449, 704, 825, 857, 895, 937, 954, 964

Update-2

Function should increment the count for a tuple only if all the values in that tuple are present in a particular row. For example, if I search for (39, 205), it should return the frequency of 2 because 2 of the rows include both 39 and 205 (the first and second rows).

No it is a simple data frame. I added an example to make it clear. Kindly have a look. — Ashar, Feb 20 '21 at 18:56
Is it like searching the occurrence of the two items of the tuple in the given data frame? — Comsavvy, Feb 20 '21 at 19:00
@Comsavvy Yes I want to search and return the total occurrences of all tuples present in the set in the given data frame. — Ashar, Feb 20 '21 at 19:02
Let me work on a function for that, I will get back to you. I formatted the question now, kindly accept it. — Comsavvy, Feb 20 '21 at 19:04
I can't test right now so I'm gonna leave this as a comment. `{items: sum(1 for row in map(set, df.itertuples(name=None)) if all(val in row for val in items)) for items in itemsets}` — Roy Cohen, Feb 20 '21 at 19:19
@Ashar I just answered the question a minute now using `count()` function check it out and if it's works give me an upvote and ✅ — Comsavvy, Feb 20 '21 at 19:44
@Ashar how is the solution? Does it answers your question? Because lots of misunderstanding is going on it. Check the edit I made, and let me know what you think. — Comsavvy, Feb 20 '21 at 21:03

Comsavvy · Accepted Answer · 2021-02-22T14:37:11.573

1

This function will returns a dictionary which contains the occurrences of the tuple's count in the entire rows of the data frame.

from collections import defaultdict
def count(df, sequence):
    dict_data = defaultdict(int)
    shape = df.shape[0]
    for items in sequence:
        for row in range(shape):
            dict_data[items] += all([item in df.iloc[row, :].values for item in items])
    return dict_data

You can pass in the data frame and the set to the count() function and it will return the occurrences of the tuples in the entire rows of the data frame for you i.e

>>> count(data, itemsets)
defaultdict(<class 'int'>, {(39, 205): 2})

And you can easily change it from defaultdict to dictionary by using the dict() method i.e.

>>> dict(count(data, itemsets))
{(39, 205): 2}

But both of them still works the same.

edited Feb 22 '21 at 14:37

answered Feb 20 '21 at 19:28

Comsavvy

630
9
18

When running this with the supplied sample input, the frequency came out as `5` instead of `2`. – Roy Cohen Feb 20 '21 at 19:49
According to the question, we are counting the occurrence of the two items in the data frame. Check the data correctly, 39 appears 3 times while 205 appears 2times – Comsavvy Feb 20 '21 at 19:59
@RoyCohen You can check out my discussion with Ashar in the question. – Comsavvy Feb 20 '21 at 20:02
The question clearly states *"If I search for `(39, 205)`, it should return the frequency of 2."*. Your answer returns 5. – Roy Cohen Feb 20 '21 at 20:04
I assume it to be a mistake by @Ashar – Comsavvy Feb 20 '21 at 20:06
I think there is some missunderstanding about what the question is about, I'll edit my answer to include what I think the question is. – Roy Cohen Feb 20 '21 at 20:08
@Comsavvy Thanks but as pointed out by Row Cohen, there is no mistake in the question. A search for (39, 205) should return 2 because increment should be done only if both the items are present in a row. – Ashar Feb 21 '21 at 08:03
You just clarify the question now! @Ashar – Comsavvy Feb 21 '21 at 08:07
1

@Comsavvy Sure. I'll wait. – Ashar Feb 21 '21 at 08:09
1

@Ashar Done! Changes have been made to the function. Check it out! And don't forget our earlier discussion in the question section. – Comsavvy Feb 21 '21 at 11:07
Did you test this with more than one item set? I think it's only going to work for the first item set because the line `count = 0` is outside the for loop. If I'm correct, and that line should be moved inside the for loop, it'll be even better to replace the while loop with a for loop `for count in range(shape):`. Additional note: the name `count` doesn't really makes sence, I think it should be renamed `row_index` and overall there is a lack of explanation in your answer. P.S. I liked your use of `defaultdict`. – Roy Cohen Feb 21 '21 at 14:05
the `while loop` is the best for this case, not every time one will be using `for loop`. What I want you to know is that not everybody likes to read a whole bunch of explanation without implementation. I will say Experience and Practice is the best for me. Did you check the condition in the while loop? As soon as I increment the `count` by 1 the row will also be incremented by one. My convention for naming it is `count`. And you can test the function if their is any problem please let me know. – Comsavvy Feb 21 '21 at 15:10
Thanks for all the suggestion! @RoyCohen – Comsavvy Feb 21 '21 at 15:42
So, correct me if I'm wrong, the values in the `count` variable will be, `0`, `1`, ..., `shape-1`. That's exaclly what `range` is for, and `for count in range(shape)` will be instantly recognizable by any python programmer to mean "`count`'s values will be `0`, `1`, ..., `shape-1`". But in the while version this logic is split across several lines and it's less obvious. – Roy Cohen Feb 21 '21 at 23:59
For that case, you are iterating a sequence. But I don't need to do that to get the work done. It's not a must I use `for loop` always, both of them are useful for different purposes. For this case I will always use `while loop`, you should also value it. Check the time of their execution and compare both you will see the difference. – Comsavvy Feb 22 '21 at 00:44
I'm not saying you should never use a while loop, but in this specific case a for loop is better. In general, a while loop is more useful when you don't know in advance the number of times you need to iterate. Using a while loop splits the logic across several lines, and it's less common in python, so python programmers won't recognize it immediately. Regarding execution time, a simple test reveals that using a for loop is faster. I tested this using `timeit('for i in range(100): pass')` which returned 10.11, compared to `timeit('i = 0\nwhile i < 100: i += 1')` which returned 35.89. – Roy Cohen Feb 22 '21 at 01:34
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229043/discussion-between-comsavvy-and-roy-cohen). – Comsavvy Feb 22 '21 at 14:16
@Comsavvy The code works fine if length of tuple is greater than 1. Is it possible to make it execute when tuple contains just a single element? – Ashar Feb 22 '21 at 14:35
@Ashar If the tuple contain only one element, it has to end with comma (,) i.e `(49, )` – Comsavvy Feb 22 '21 at 14:40
@Comsavvy Got it. Thanks – Ashar Feb 22 '21 at 14:42
Give me an upvote, you've forgotten that. @Ashar – Comsavvy Feb 22 '21 at 14:46
@Comsavvy I accepted your solution. Anyways, I upvoted it as well. – Ashar Feb 22 '21 at 14:51
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229045/discussion-between-comsavvy-and-ashar). – Comsavvy Feb 22 '21 at 14:55

Alex Metsai · Answer 2 · 2021-02-20T23:59:15.540

0

itemsets = {(39, 205),(39, 205, 401), (143, 157), (143, 166), (175, 178), (175, 190)}

x = [[39,120,124,205,401,581,704,814,825,834],
[35,39,205,712,733,759,854,950],
[39,422,449,704,825,857,895,937,954,964]]

data = pd.DataFrame(x)

for itemset in itemsets:
    print(itemset)
    count = 0
    for i in range(len(data)):
        flag = True
        for item in itemset:
            if item not in data.loc[i].value_counts():
                flag = False
        if flag:
            count += 1
    print(count)

Edited to take into account abstract itemset lengths, as suggested in the comments (many thanks for the useful insights).

edited Feb 20 '21 at 23:59

answered Feb 20 '21 at 19:20

Alex Metsai

1,837
5
12
24

This only works if the length of each item set is 2, and making it work with, for example, 10 will be very dificult. Also, instead of `item0, item1 = item[0], item[1]`, as long as `len(item) == 2`, you can use `item0, item1 = item`. – Roy Cohen Feb 20 '21 at 19:52
From the question I took as granted that the length of each item is always 2. In any case, thanks for the suggestion. I can edit my code to make it work for an abstract length of itemsets (and maybe I will, else I will delete it) but I think that your answer is more complete, I will upvote it. – Alex Metsai Feb 20 '21 at 19:59
I updated the code, thanks for your suggestion! Take a look if you don't mind! Also, I up voted your answer, I do think it's better. ^^ – Alex Metsai Feb 20 '21 at 20:12
I have another suggestion (sorry for being a nag), after setting `flag` to `False` there is no way it can return to `True` again, so you can break out of the loop. In fact, you don't need the `flag` at all! You can use an else clause on a for loop to mean "no break", since a lot of python programmers are not aware of this, be sure to leave a comment explaining what is going on. – Roy Cohen Feb 20 '21 at 20:19
Additional information about `for ... else` can be fount in [this answer](https://stackoverflow.com/a/23748240/14160477) – Roy Cohen Feb 20 '21 at 20:23
You 're not being a nag, this is really constructive feedback. The concept of "no break" in loop is really interesting, but I'm a little bit confused regarding how I'm going to use it in my case, I haven't figured it out yet. – Alex Metsai Feb 20 '21 at 20:34

Roy Cohen · Answer 3 · 2021-03-03T05:00:31.640

First of all, since there's some misunderstanding about what the question is, this answer answers the question "How to count the number of rows in which every item in the item set appears at least once?".

for each row in the data frame, we can decide if it's counted in the frequency using

all(item in row for item in items)

where items is an item set, for example, (39, 205).

We can iterate over all the rows using DataFrame.itertuples, so for every item set items, its frequency is

sum(1 for row in map(set, df.itertuples(name=None)) if all(item in row for item in items))

(We use map(set, ...) to turn the tuples into sets, this is not needed but it improves efficiency)

Finally, we iterate over all the item sets in itemsets and store the result in a dictionary where the keys are the item sets and the values are the frequencies:

{items: sum(1 for row in map(set, df.itertuples(name=None)) if all(item in row for item in items)) for items in itemsets}

Output: The output for the case you supplied is `{(39, 205): 2}`

If you didn't like the one-line version, you can expand the algorithm into several lines like so:

d = {}  # output dictionary
for items in itemsets:
    frequency = 0
    for row in df.itertuples(name=None):
        row = set(row)  # done for efficiency
        for item in items:
            if item not in row:
                break
        else:  # no break
            frequency += 1
    d[items] = frequency

Additional information about for ... else can be found in this answer

I will appreciate it if you can check the final modification I made to the solution given by me. — Comsavvy, Feb 21 '21 at 11:34

Count frequency of itemsets in the given data frame

3 Answers3

Output: The output for the case you supplied is {(39, 205): 2}

Output: The output for the case you supplied is `{(39, 205): 2}`