The very fast way to find repeating combinations in Python using pandas?

Question

I have this "DrawsDB.csv" sample file as input:

Day,Hour,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14,N15,N16,N17,N18,N19,N20
1996-03-18,15:00,4,9,10,16,21,22,23,26,27,34,35,41,42,48,62,66,68,73,76,78
1996-03-19,15:00,6,12,15,19,28,33,35,39,44,48,49,59,62,63,64,67,69,71,75,77
1996-03-21,15:00,2,4,6,7,15,16,17,19,20,26,28,45,48,52,54,69,72,73,75,77
1996-03-22,15:00,3,8,15,17,19,25,30,33,34,35,36,38,44,49,60,61,64,67,68,75
1996-03-25,15:00,2,10,11,14,18,22,26,27,29,30,42,44,45,55,60,61,66,67,75,79
2022-01-01,15:00,1,9,12,17,33,34,36,37,38,44,45,46,53,56,58,60,62,63,70,72
2022-01-01,22:50,1,3,4,14,19,22,24,27,32,33,35,36,44,48,53,55,69,70,76,78
2022-01-02,15:00,13,15,16,19,22,24,31,37,38,43,47,58,64,66,70,72,73,75,76,78
2022-01-02,22:50,5,10,11,14,16,28,29,36,41,53,54,56,58,59,61,67,68,71,73,77
2022-01-03,15:00,8,9,10,11,15,20,21,22,26,30,35,36,39,42,52,58,63,64,73,80
2022-01-03,22:50,4,9,17,21,22,32,33,34,36,37,38,41,48,49,50,60,64,69,70,75
2022-01-04,15:00,4,5,7,9,11,16,17,21,22,25,30,37,38,39,44,49,52,60,65,78
2022-01-04,22:50,17,18,22,26,27,30,31,40,43,49,55,62,63,64,65,71,72,73,76,80
2022-01-05,15:00,1,5,8,14,15,20,23,25,26,33,34,35,37,47,54,59,67,70,72,76
2022-01-05,22:50,6,7,14,15,16,18,26,37,39,41,45,51,52,54,55,59,61,70,71,80
2022-01-06,15:00,9,10,11,17,28,30,32,41,42,44,45,49,50,51,55,65,67,72,76,78
2022-01-06,22:50,1,2,6,9,11,15,21,26,31,37,40,43,47,51,52,54,67,68,73,75

This is just a sample. The real csv file is more than 50.000 rows in total. N1 to N20 columns contains random values, non repeating across the same row, which means they are not duplicate. And they are sorted from smallest one (N1) to the biggest one (N20).

I want to get repeating combos (e.g. of 5 numbers let's say) across all rows from the DataFrame from columns N1 to N20.

So, for the entire .csv file posted above the output should be:

(6, 15, 26, 52, 54) 3
(17, 33, 34, 36, 38) 3
(17, 33, 34, 36, 60) 3
(17, 33, 34, 38, 60) 3
(17, 33, 36, 38, 60) 3
(17, 34, 36, 38, 60) 3
(33, 34, 36, 38, 60) 3
...

This is the full ouput which I'm not posting here because of text size limitations:

https://pastebin.com/4EVXXSn1

Please check it out.

Sorry for making such long output, I tried to create a shorter one but didn't succeed in getting representative combos for it.

This is the Python code I wrote to accomplish what I need: (please read its commented lines too)

import pandas as pd
from itertools import combinations
from collections import Counter

df = pd.read_csv("DrawsDB.csv")
# looping through db using method found here:
# https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
df = df.reset_index()  # make sure indexes pair with number of rows
draws = []
# please read this: https://stackoverflow.com/a/55557758/7710871   (Conclusion:iter is very slow)
for index, row in df.iterrows():
    draws.append(
        [row['N1'], row['N2'], row['N3'], row['N4'], row['N5'], row['N6'], row['N7'], row['N8'], row['N9'], row['N10'],
         row['N11'], row['N12'], row['N13'], row['N14'], row['N15'], row['N16'], row['N17'], row['N18'], row['N19'],
         row['N20']])

# comparing to each other in order to check for repeating combos:
repeating_combos = []
for i in range(len(draws)):
    for j in draws[i + 1:]:
        repeating_combos.append(sorted(list(set(draws[i]).intersection(j))))

# e.g. getting any repeating combo of 5 across all rows:
combos_of_5 = []
for each in repeating_combos:
    if len(each) == 5:
        combos_of_5.append(tuple(each))
        # print(each)
    elif len(each) > 5:
        # e.g. a repeating sequence of 6 numbers means in fact 6 combos taken by 5 numbers in this case.
        # e.g. a repeating sequence of 7 numbers means in fact 21 combos of 5 numbers and so on.
        # Combinations(k, n)
        for cmb in combinations(each, 5):
            combos_of_5.append(tuple(sorted(list(set(cmb)))))

# count how many times each combo appear:
x = Counter(combos_of_5)

sorted_x = dict(sorted(x.items(), key=lambda item: item[1], reverse=True))
for k, v in sorted_x.items():
    print(k, v)

It works very well, as expected but there is one single problem: for a bigger DataFrame it takes a lot of time to do its job done. More than that, if you want to get repeating combinations with more than 5 numbers (let's say with 6, 7, 8 or 9 numbers) it will take for ever to run.

How to do it in full pandas in a very fast and much more smarter way than I did?

Also, please note that it does not generate every combo in the first instance and after that start looking for each of those combos into DataFrame because it will take even longer.

Thank you very much in advance!

P.S. What if the numbers from N1 to N20 were not sorted? Will this make any difference?

I read this topic and many others already but none is asking for the same thing so I think it is not duplicate and this could help many other have the same or very similar problem.

"I want to get repeating combos (e.g. of 5 numbers let's say) across all rows from the DataFrame from columns N1 to N20." I can't understand what you mean by "repeating combos", and I don't see why the expected output you show is correct. Please try to explain the intended logic, step by step, in plain English. — Karl Knechtel, May 30 '22 at 08:51
@KarlKnechtel Sorry for my English, I am not a native speaker. By repeating combos I mean numbers that each row has in common with any other row by taking care only about columns from N1 to N20. e.g. from my output: (6, 15, 26, 52, 54) 3 it meanse 6,15,26,52 and 54 are found exactly in 3 rows in the entire DataFrame. So, if you check it out, you will find that draws from 1996-03-21,15:00, 2022-01-05,22:50 and 2022-01-06,22:50 contain the numbers 6,15,26,52 and 54 and they are exactly 3 draws (rows) in total. That is why the output is: (6, 15, 26, 52, 54) 3. Same for the others from the output. — YoYoYo, May 30 '22 at 09:01
Okay, so. For each row, you are looking for each way to choose 5 of the numbers, and then building a histogram? — Karl Knechtel, May 30 '22 at 09:04
One issue I see, because of how `repeating_combos` is used: let's say there are four rows that have a matching set of five numbers, and everything else has no overlap. We look at each pair of rows to find overlaps... there will be a total of six such pairs. On the other hand, unique combinations never get counted at all. — Karl Knechtel, May 30 '22 at 09:10
@KarlKnechtel Look at this image I posted here: https://i.imgur.com/swyEo8W.png As you can see just three rows are good and counted (3 in total) because they have ALL 5 numbers. More clearly, if you do in python list(set(row1).intersection(row2)) you will get a list with the length of exactly 5 numbers. And they are counted because they have 5 numbers in common. Check my python code, if they have more than 5 numbers in common and I am looking for repeating combos of 5 then it means they have more combination of 5 numbers in common to count. — YoYoYo, May 30 '22 at 09:15
If I was looking for repeating combinations of 6 numbers then these rows doesn't count because they have in common only 5 numbers and not 6. And so on. — YoYoYo, May 30 '22 at 09:16
@KarlKnechtel Please read this topic, this is what I am trying to accomplish and I explained that there much better: https://stackoverflow.com/questions/72460392/c-program-needs-to-much-memory-to-run-and-crashes-how-to-avoid-this — YoYoYo, Jun 01 '22 at 13:40

Drakax · Answer 1 · 2022-05-30T14:50:24.510

2

Proof of work:

Given this part of your dataframe:

index	Day	Hour	N1	N2	N3	N4	N5	N6	N7	N8	N9	N10	N11	N12	N13	N14	N15	N16	N17	N18
0	1996-03-18	15:00	4	9	10	16	21	22	23	26	27	34	35	41	42	48	62	66	68	73
1	1996-03-19	15:00	6	12	15	19	28	33	35	39	44	48	49	59	62	63	64	67	69	71
2	1996-03-21	15:00	2	4	6	7	15	16	17	19	20	26	28	45	48	52	54	69	72	73
3	1996-03-22	15:00	3	8	15	17	19	25	30	33	34	35	36	38	44	49	60	61	64	67

You can update your code with something similar to the one below:

check = [6,15]
df['check'] = df.iloc[:,2:].apply(lambda r: all(s in r.values for s in check), axis=1)
true_count = df.check.sum()
print(f'The following numbers {check} appear {true_count} time(s) in the dataframe.')

Result:

The following numbers [6, 15] appear 2 time(s) in the dataframe.

edited May 30 '22 at 14:50

answered May 30 '22 at 14:24

Drakax

1,305
3
9

1

This is just for searching for a given combination of numbers but how to get repeating numbers by comparing rows against each other in order to get combos and then counting for them? – YoYoYo May 30 '22 at 19:30
1

Please read this topic, this is what I need to do. Maybe it is possible to do it using pandas? https://stackoverflow.com/questions/72460392/c-program-needs-to-much-memory-to-run-and-crashes-how-to-avoid-this – YoYoYo Jun 01 '22 at 13:41
1

I still don't understand, given your desired output in your question and your results following this link: https://pastebin.com/4EVXXSn1 Those numbers aren't consecutive numbers then what are you looking for? – Drakax Jun 01 '22 at 14:03
1

It is not looking for consecutive numbers. It is looking for numbers that repeat from a draw to another. After that you choose from all that numbers only numbers groups that are the same given length and then count each of them. Just this. – YoYoYo Jun 01 '22 at 16:51
1

Look for example this image: https://i.stack.imgur.com/OtirY.png. As you can see those highlighted numbers are count because that group of numbers is the only one that repeat from row to another. No matter if they are consecutive or not. Finally it output: 23, 24, 25, 26, 26 and next to it showing 4 because all of them have been found together 4 times. And it just found them by comparing rows to each other without the need to input or to generate them some way. – YoYoYo Jun 01 '22 at 16:56

The very fast way to find repeating combinations in Python using pandas?

1 Answers1

Linked