I successfully used pandas to create a Pearson correlation matrix. From this matrix, I was able to extract many pairs of genes that correlated more than a certain threshold (0.75 in this instance) and then saved each pair as a tuple within a large list.
How do I now go about through this list to generate every possible correlated gene combination that's more than just pairs? For example:
Lets say there are 4 genes: A, B, C, and D. Each gene is correlated with the other (and itself obviously). This means somewhere in the big list there are the following 10 separate tuple pairs [(A,A), (B,B), (C,C), (D,D), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D)]. However, from this you could also now create the tuples (A, B, C) and (A, B, C, D) and (B, C, D) and (A, C, D) and (A, B, D) and so on since they all correlate with each other. How do I go about writing a function that can create every combination of these new groups using just the pairs I currently have saved in my list?
By the way, if gene A correlated with gene B, and gene B correlated with gene C, but gene A did not correlate with gene C, they would NOT form the group (A, B, C). Everything has to correlate with each other to form a group.
The number of tuple pairs is in the millions, so finding an efficient way to get this done would be ideal (but not essential).
I am honestly not sure where I can begin. I was recommended to use a function that would give me all subsets of an array but that doesn't really help with the overall issue. Here are some possible steps I thought of that would technically get me what I want but would be extremely inefficient.
- I make a simple list of every single gene there is from all the pairs.
- I run the command to generate every possible subset of this list of genes.
- I then comb through every single generated subset and using the paired list check that everything within the subset correlates with each other. If it doesn't, toss that subset out.
- The remaining non tossed out subsets are my answers.
Sample Input: [(A,A), (B,B), (C,C), (D,D), (E,E), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D), (C,E)]
Sample Output: [(A,A), (B,B), (C,C), (D,D), (E,E), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D), (C,E), (A,B,C,D), (A,B,C), (B,C,D), (A,C,D), (A,B,D)]
Note how E isn't found in any of the new combinations since it only correlates with itself and C so it can't be included in any other group.