0

I successfully used pandas to create a Pearson correlation matrix. From this matrix, I was able to extract many pairs of genes that correlated more than a certain threshold (0.75 in this instance) and then saved each pair as a tuple within a large list.

How do I now go about through this list to generate every possible correlated gene combination that's more than just pairs? For example:

Lets say there are 4 genes: A, B, C, and D. Each gene is correlated with the other (and itself obviously). This means somewhere in the big list there are the following 10 separate tuple pairs [(A,A), (B,B), (C,C), (D,D), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D)]. However, from this you could also now create the tuples (A, B, C) and (A, B, C, D) and (B, C, D) and (A, C, D) and (A, B, D) and so on since they all correlate with each other. How do I go about writing a function that can create every combination of these new groups using just the pairs I currently have saved in my list?

By the way, if gene A correlated with gene B, and gene B correlated with gene C, but gene A did not correlate with gene C, they would NOT form the group (A, B, C). Everything has to correlate with each other to form a group.

The number of tuple pairs is in the millions, so finding an efficient way to get this done would be ideal (but not essential).

I am honestly not sure where I can begin. I was recommended to use a function that would give me all subsets of an array but that doesn't really help with the overall issue. Here are some possible steps I thought of that would technically get me what I want but would be extremely inefficient.

  1. I make a simple list of every single gene there is from all the pairs.
  2. I run the command to generate every possible subset of this list of genes.
  3. I then comb through every single generated subset and using the paired list check that everything within the subset correlates with each other. If it doesn't, toss that subset out.
  4. The remaining non tossed out subsets are my answers.

Sample Input: [(A,A), (B,B), (C,C), (D,D), (E,E), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D), (C,E)]

Sample Output: [(A,A), (B,B), (C,C), (D,D), (E,E), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D), (C,E), (A,B,C,D), (A,B,C), (B,C,D), (A,C,D), (A,B,D)]

Note how E isn't found in any of the new combinations since it only correlates with itself and C so it can't be included in any other group.

  • please provide a small example of exact input and output you expect – anon01 Jan 02 '23 at 19:01
  • use python's `itertools` and `combinations()` to create combinations. – Shmack Jan 02 '23 at 19:03
  • 1
    You will want to consider [itertools.combinations](https://docs.python.org/3/library/itertools.html#itertools.combinations), I suspect. – theherk Jan 02 '23 at 19:04
  • 1
    Also, adding sample input and output would be helpful rather than strictly prose. – theherk Jan 02 '23 at 19:06
  • 1
    it sounds like you are trying to create something like a `power set`? https://docs.python.org/3/library/itertools.html#itertools-recipes – anon01 Jan 02 '23 at 19:06
  • 2
    Power set is the way to go -> https://stackoverflow.com/a/18035641/3155240 – Shmack Jan 02 '23 at 19:09
  • Does this answer your question? [Powersets in Python using itertools](https://stackoverflow.com/questions/18035595/powersets-in-python-using-itertools) – Pranav Hosangadi Jan 02 '23 at 19:18
  • 1
    If you look at the sample input and output I just provided, you will see the powerset tool is not sufficient. Sure, you are generating combinations, but it doesn't utilize the fact of whether or not everything in the combination is correlated with each other or not. In the example Input and Output, you may notice that E wasn't used in any of the new larger combos since E only correlates with itself and C and nothing else. – Sanjiv Prasad Jan 02 '23 at 19:22

2 Answers2

0

First, this is actually an interesting question once I understood what you were after. This can be done, but my intuition tells me there is probably a much better, more efficient way than what I have here. This uses only itertools.combinations and list comprehensions and a set (just to prevent duplicates).

from itertools import combinations

genes = ["A", "B", "C", "D"]

# We assume each pair is lexicographical.
# If not, they must be sorted.
pairs = [
    ("A", "A"),
    ("B", "B"),
    ("C", "C"),
    ("D", "D"),
    ("E", "E"),
    ("A", "B"),
    ("A", "C"),
    ("A", "D"),
    ("B", "C"),
    ("B", "D"),
    ("C", "D"),
    ("C", "E"),
]

# List of lengths of all combinations greater than pairs.
# So, if you have 6 unique genes [A,B,C,D,E,F], this will be [3, 4, 5, 6]
lengths = [l + 1 for l in range(len(genes)) if l + 1 > 2]

# Combinations that can be built from the pairs.
new_combs = {c for l in lengths for c in combinations(genes, l)}

# This is a list of those combinations, where each pair built for a given
# combination is found in the original pairs list.
# For example, if a new combination is [A,B,C], it is only included in
# correlations if AB, BC, and AC are all found in pairs.
correlations = {x for x in new_combs if all([c in pairs for c in combinations(x, 2)])}
print(list(correlations))

Output

ᐅ ./main.py
[('A', 'B', 'C'), ('A', 'B', 'C', 'D'), ('A', 'B', 'D'), ('B', 'C', 'D'), ('A', 'C', 'D')]
theherk
  • 6,954
  • 3
  • 27
  • 52
  • Thank you for your help! Are there a couple of lines of code you can add somewhere that would let someone maybe see a progress bar so we have some idea of how much longer the program will take to run? When working with millions of data points it would be very helpful! – Sanjiv Prasad Jan 03 '23 at 03:17
  • Probably, but it isn't so straight forward, and I imagine doing some may end up slowing the process down if not implemented very precisely. Your best bet would be to profile duration for given sample sizes then use maths to calculate likely run times. – theherk Jan 03 '23 at 08:21
  • First of all, thank you for the answer. This method, combined with the clique method from another poster have helped me develop my program. I am now, however, running into issues from the sheer amount of data I am processing. While the 'lengths' and the 'correlations' lines have next to no RAM usage, the new_combs line uses gigabytes, if not terrabytes of RAM that leads to even compute clusters erroring out depending on the number of input I need to use. Do you think you know a way to rewrite that line such that it saves progress maybe iterating directly and not saving combos to RAM? – Sanjiv Prasad Feb 07 '23 at 20:47
  • Well that set is being generated completely in memory, so the way not to do that is to use a [generator](https://wiki.python.org/moin/Generators). But how to implement that in this case is probably beyond the scope of this question. I would suggest making a minimally reproducible example and posting another question. – theherk Feb 09 '23 at 07:41
0

This falls into the class of problems known as clique problems.

Specifically you want to list all maximal cliques of the graph whose vertices are your genes and whose edges are your corellated pairs.

According to wikipedia the most efficient algorithm in pratice is the Bron–Kerbosch algorithm (or its variations) from 1973.


Here is an implementation using find_cliques from the networkx package. It uses the Bron and Kerbosch algorighm.

editors note: It doesn't return the expected output, but this is a fascinating bit of information. I don't know enough about graph theory to validate if this should work or not or what needs to change to make it work. If the original poster would fix this or provide guidance that would be superb.

import networkx as nx
from networkx.algorithms.clique import find_cliques as maximal_cliques

genes = ["A", "B", "C", "D", "E"]
pairs = [
    ("A", "A"),
    ("B", "B"),
    ("C", "C"),
    ("D", "D"),
    ("E", "E"),
    ("A", "B"),
    ("A", "C"),
    ("A", "D"),
    ("B", "C"),
    ("B", "D"),
    ("C", "D"),
    ("C", "E"),
]

G = nx.Graph()
G.add_nodes_from(genes)
G.add_edges_from(pairs)

print(list(maximal_cliques(G)))

This yields:

[['C', 'D', 'B', 'A'], ['C', 'E']]

This package is mentioned in this related answer.


I was slightly of in my original awnser: The desired result as specified by OP is the set of all complete subgraphs (i.e. cliques) not just the maximal ones. To go from the maximal cliques to every clique one simply has to add all possible subsets of each maximal clique. As every subgraph of a maximal complete subgraph of G is also itself a complete subgraph of G.

Or one can just use enumerate_all_cliques from the networkx package.

import networkx as nx
from networkx.algorithms.clique import enumerate_all_cliques as all_cliques

pairs = [
    # the self referencing edges (e.g. [A,A]) are not needed
    ("A", "B"),
    ("A", "C"),
    ("A", "D"),
    ("B", "C"),
    ("B", "D"),
    ("C", "D"),
    ("C", "E"),
]

G = nx.Graph()
# one does not have to explicilty add the vertices as long as they are part of an edge
G.add_edges_from(pairs)

print(list(all_cliques(G)))

This yields:

[['A'], ['B'], ['C'], ['D'], ['E'], ['A', 'B'], ['A', 'C'], ['A', 'D'], ['B', 'C'], ['B', 'D'], ['C', 'D'], ['C', 'E'], ['A', 'B', 'C'], ['A', 'B', 'D'], ['A', 'C', 'D'], ['B', 'C', 'D'], ['A', 'B', 'C', 'D']]

Strictly speeking this is still not quite what was requested, as the self-correlation entries are represented as [A] instead of [A,A] but that can easly be fixed with some postprocessing.

PeterE
  • 5,715
  • 5
  • 29
  • 51