0

Started from here: unique combinations of values in selected columns in pandas data frame and count

I have found the most to least occurring combinations of 3 columns with this code:

def common_cols(df,n):
    '''n is how many of the top results to show'''

    df = df.groupby(['A','B','C']).size().reset_index().rename(columns={0:'count'})

    df = df.sort_values(by='count', ascending=False).reset_index(drop=True).head(n)

    return df

common_data = common_cols(df,10)

Output of common_data(10 top results shown):

      A     B       C      count
0    0.00  0.00    0.00     96
1    0.00  1.00    0.00     25
2    0.14  0.86    0.00     19
3    0.13  0.87    0.00     17
4    0.00  0.72    0.28     17
5    0.00  0.89    0.11     16
6    0.01  0.84    0.15     16
7    0.03  0.97    0.00     15
8    0.35  0.65    0.00     15
9    0.13  0.79    0.08     14 

Now, I would like to find combinations of the A B C rows, and count how many times they occurred.

For example lets say in the BASE df from row 1 to row 4:

the first set of combinations of the 3 columns (as told by dataframe(df) BEFORE using the common_cols function) are

# each of these rows are their own combination of values
       A    B     C
0    0.67  0.16  0.17
1    0.06  0.73  0.20
2    0.19  0.48  0.33
3    0.07  0.87  0.06
4    0.07  0.60  0.33

The above 5 rows (in order) would be counted as a pattern of combinations. It could be counted as a combination of 2 rows, 3 rows, 4 rows or more rows (if it is easy enough to do that is!)

If this pattern was found once(across the entire dataframe), it would output this pattern's count as 1. If it was found 10 times; the count would be 10.

Any ideas on how I can count the combinations found between consecutive rows? Like from using the common_cols function, but as a 'combinations of combinations'?

The rows have to be in order for it to be a pattern. Any help is massively appreciated!

  • I am not sure if this is at all the right approach but maybe it helps: You could (iteratively) add columns to each row that holds lists of the values of the next N rows. For example, column N_5 for row 0 holds [row0, row1, row2, row3, row4]. Then, for each of these columns you can then concatenate all these columns with their respective count columns and do the group by function. Does that makes sense? – JarroVGIT Feb 07 '21 at 12:59
  • are you counting frequencies of the a,b,c values? if so a group and size should work. what else do you need? – Golden Lion Feb 07 '21 at 22:23
  • @JarroVGIT I am not sure, especially since this is my first time using groupby. Thanks for the help and idea :). Rick M's answer has what seems to be a working soloution – helloworldnoob Feb 10 '21 at 13:18
  • @GoldenLion The question was about counting the amount of times a consecutive pattern was found. For example, if in the data set the numbers (in order): 0.5, 0.25, 0.25 were found with 0.2, 0.8, 0.0 multiple times, I was looking for a pattern like that – helloworldnoob Feb 10 '21 at 13:21
  • 1
    use value_counts or Counter to get the unique counts for the combinations or groupby with size – Golden Lion Feb 10 '21 at 13:54

1 Answers1

2

I used integers for this test dataframe, but if your groupby is working above this should also work for your data:

df_size = 1000000
df = pd.DataFrame( { 'A' : (np.random.randint(20) for i in range(df_size)),
                     'B' : (np.random.randint(20) for i in range(df_size)),
                     'C' : (np.random.randint(20) for i in range(df_size)),
            })

print(df.head())
    A   B   C
0  12  12   5
1  19  12  12
2  14  11  15
3  11  14   8
4  13  16   2

The code below makes a list called source of the triplets (A, B, C) using zip. The tmp variable (a generator) is effectively a list that holds successively "shifted" copies of the source list, like [source[0:], source[1:], source[2:]...]

Finally, the zip interleaves the values from the lists in tmp, e.g., for n=2 it would generate a list of [(source[0], source[1]), (source[1], source[2]), ... ]

source = list(zip(df['A'],df['B'],df['C']))
n_consecutive = 3

tmp = ( source[i:] for i in range(n_consecutive) )
output = pd.Series(list(zip(*tmp)))

For this example, this is a series containing the counts of the triplet (A, B, C) values:

print(output.value_counts().head())
((6, 19, 14), (19, 12, 6), (13, 7, 10))    2
((2, 18, 12), (17, 2, 19), (7, 19, 19))    1
((10, 2, 3), (1, 18, 8), (3, 6, 19))       1
((16, 15, 14), (11, 2, 9), (14, 14, 8))    1
((3, 3, 7), (13, 9, 3), (18, 15, 6))       1
dtype: int64

Note that this will possibly double-count things depending on what you are looking for. For example, if the base df has three records in a row, and you're looking for patterns of 2 consecutive:

(1, 3, 4)
(1, 3, 4)
(1, 3, 4)

In that case it will find (1, 3, 4), (1, 3, 4) twice.

Rick M
  • 1,012
  • 1
  • 7
  • 9
  • Awesome man thank you very much. I am not sure about the double count but it seems to be working as it should, compared to the different code I was running to get the same result(my method was super clunky, yours is a godsend). Appreciated :) – helloworldnoob Feb 10 '21 at 13:16
  • Would you know how to implement a np.isclose() solution inside your code? Or is that something I would have to first do? – helloworldnoob Feb 10 '21 at 13:33
  • 1
    You're welcome! You could truncate the values to a desired level of precision before grouping and that would probably work, or maybe pre-binning the data with `np.histogramdd`... you might want to put up a separate question if that doesn't work. If you're good to go on this one, please mark it as 'accepted' by clicking the green check mark. This helps keep the focus on older SO questions which still don't have answers. Thanks, and take care. – Rick M Feb 10 '21 at 13:43