I have this "DrawsDB.csv" sample file as input:
Day,Hour,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14,N15,N16,N17,N18,N19,N20
1996-03-18,15:00,4,9,10,16,21,22,23,26,27,34,35,41,42,48,62,66,68,73,76,78
1996-03-19,15:00,6,12,15,19,28,33,35,39,44,48,49,59,62,63,64,67,69,71,75,77
1996-03-21,15:00,2,4,6,7,15,16,17,19,20,26,28,45,48,52,54,69,72,73,75,77
1996-03-22,15:00,3,8,15,17,19,25,30,33,34,35,36,38,44,49,60,61,64,67,68,75
1996-03-25,15:00,2,10,11,14,18,22,26,27,29,30,42,44,45,55,60,61,66,67,75,79
2022-01-01,15:00,1,9,12,17,33,34,36,37,38,44,45,46,53,56,58,60,62,63,70,72
2022-01-01,22:50,1,3,4,14,19,22,24,27,32,33,35,36,44,48,53,55,69,70,76,78
2022-01-02,15:00,13,15,16,19,22,24,31,37,38,43,47,58,64,66,70,72,73,75,76,78
2022-01-02,22:50,5,10,11,14,16,28,29,36,41,53,54,56,58,59,61,67,68,71,73,77
2022-01-03,15:00,8,9,10,11,15,20,21,22,26,30,35,36,39,42,52,58,63,64,73,80
2022-01-03,22:50,4,9,17,21,22,32,33,34,36,37,38,41,48,49,50,60,64,69,70,75
2022-01-04,15:00,4,5,7,9,11,16,17,21,22,25,30,37,38,39,44,49,52,60,65,78
2022-01-04,22:50,17,18,22,26,27,30,31,40,43,49,55,62,63,64,65,71,72,73,76,80
2022-01-05,15:00,1,5,8,14,15,20,23,25,26,33,34,35,37,47,54,59,67,70,72,76
2022-01-05,22:50,6,7,14,15,16,18,26,37,39,41,45,51,52,54,55,59,61,70,71,80
2022-01-06,15:00,9,10,11,17,28,30,32,41,42,44,45,49,50,51,55,65,67,72,76,78
2022-01-06,22:50,1,2,6,9,11,15,21,26,31,37,40,43,47,51,52,54,67,68,73,75
This is just a sample. The real csv file is more than 50.000 rows in total. N1 to N20 columns contains random values, non repeating across the same row, which means they are not duplicate. And they are sorted from smallest one (N1) to the biggest one (N20).
I want to get repeating combos (e.g. of 5 numbers let's say) across all rows from the DataFrame from columns N1 to N20.
So, for the entire .csv file posted above the output should be:
(6, 15, 26, 52, 54) 3
(17, 33, 34, 36, 38) 3
(17, 33, 34, 36, 60) 3
(17, 33, 34, 38, 60) 3
(17, 33, 36, 38, 60) 3
(17, 34, 36, 38, 60) 3
(33, 34, 36, 38, 60) 3
...
This is the full ouput which I'm not posting here because of text size limitations:
Please check it out.
Sorry for making such long output, I tried to create a shorter one but didn't succeed in getting representative combos for it.
This is the Python code I wrote to accomplish what I need: (please read its commented lines too)
import pandas as pd
from itertools import combinations
from collections import Counter
df = pd.read_csv("DrawsDB.csv")
# looping through db using method found here:
# https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
df = df.reset_index() # make sure indexes pair with number of rows
draws = []
# please read this: https://stackoverflow.com/a/55557758/7710871 (Conclusion:iter is very slow)
for index, row in df.iterrows():
draws.append(
[row['N1'], row['N2'], row['N3'], row['N4'], row['N5'], row['N6'], row['N7'], row['N8'], row['N9'], row['N10'],
row['N11'], row['N12'], row['N13'], row['N14'], row['N15'], row['N16'], row['N17'], row['N18'], row['N19'],
row['N20']])
# comparing to each other in order to check for repeating combos:
repeating_combos = []
for i in range(len(draws)):
for j in draws[i + 1:]:
repeating_combos.append(sorted(list(set(draws[i]).intersection(j))))
# e.g. getting any repeating combo of 5 across all rows:
combos_of_5 = []
for each in repeating_combos:
if len(each) == 5:
combos_of_5.append(tuple(each))
# print(each)
elif len(each) > 5:
# e.g. a repeating sequence of 6 numbers means in fact 6 combos taken by 5 numbers in this case.
# e.g. a repeating sequence of 7 numbers means in fact 21 combos of 5 numbers and so on.
# Combinations(k, n)
for cmb in combinations(each, 5):
combos_of_5.append(tuple(sorted(list(set(cmb)))))
# count how many times each combo appear:
x = Counter(combos_of_5)
sorted_x = dict(sorted(x.items(), key=lambda item: item[1], reverse=True))
for k, v in sorted_x.items():
print(k, v)
It works very well, as expected but there is one single problem: for a bigger DataFrame it takes a lot of time to do its job done. More than that, if you want to get repeating combinations with more than 5 numbers (let's say with 6, 7, 8 or 9 numbers) it will take for ever to run.
How to do it in full pandas in a very fast and much more smarter way than I did?
Also, please note that it does not generate every combo in the first instance and after that start looking for each of those combos into DataFrame because it will take even longer.
Thank you very much in advance!
P.S. What if the numbers from N1 to N20 were not sorted? Will this make any difference?
I read this topic and many others already but none is asking for the same thing so I think it is not duplicate and this could help many other have the same or very similar problem.