1

I am a Bioinformatician and recently started picking up python and i am interested in plotting a Graph.I have a set of nodes and edges.

nodes

set(['ACG', 'ATC', 'GAT', 'CGG', 'CGT', 'AAT', 'ATT', 'GGA', 'TTC', 'CCG', 'TCC', 'GAA'])

edges

[('ACG', 'CGG'), ('CGG', 'GGA'), ('GGA', 'GAA'), ('GAA', 'AAT'), ('AAT', 'ATC'), ('GAT', 'ATT'), ('ATT', 'TTC'), ('TTC', 'TCC'), ('TCC', 'CCG'), ('CCG', 'CGT')]

When i construct the normal graph using above information i get 12 nodes and 10 edges i.e. two disconnected graphs using below function.

def visualize_de_bruijn():
    """ Visualize a directed multigraph using graphviz """
    nodes = set(['ACG', 'ATC', 'GAT', 'CGG', 'CGT', 'AAT', 'ATT', 'GGA',      'TTC', 'CCG', 'TCC', 'GAA'])
    edges = [('ACG', 'CGG'), ('CGG', 'GGA'), ('GGA', 'GAA'), ('GAA', 'AAT'), ('AAT', 'ATC'), ('GAT', 'ATT'), ('ATT', 'TTC'), ('TTC', 'TCC'), ('TCC', 'CCG'), ('CCG', 'CGT')]
    dot_str = 'digraph "DeBruijn graph" {\n'
    for node in nodes:
        dot_str += '  %s [label="%s"] ;\n' % (node, node)
    for src, dst in edges:
        dot_str += '  %s -> %s ;\n' % (src, dst)
    return dot_str + '}\n'

In Biology we have this concept of complementary base pairing where A=T ,T=A ,G=C, and C=G. So complimentary of 'ACG' is 'TGC' and reverse complimentary of 'ACG' = 'CGT' i.e we reverse the complement.

In the nodes list we see 12 nodes in which 6 nodes are reverse complement of each other i.e.

ReverseComplement('ACG') = CGT 
ReverseComplement('CGG') = CCG 
ReverseComplement('GGA') = TCC   
ReverseComplement('GAA') = TTC 
ReverseComplement('AAT') = ATT 
ReverseComplement('ATC') = GAT

Now i would like to construct a graph where there are six nodes, a node should have its own value and its reverse complement value and total 10 edges i.e. graph should not be disconnected. How to visualize this graph using graphviz in python.? If there is anything else other than graphviz which can help me visualize this type of graph please let me know .?

BioGeek
  • 21,897
  • 23
  • 83
  • 145
piyush
  • 71
  • 2
  • 7

1 Answers1

2

I'm not really sure what you're trying to accomplish here (so be aware that you may be having an XY problem) but let's take your question and see where it gets us.

a node should have its own value and its reverse complement value

So we need some object to store a sequence and the reverse complement of that sequence.

There are different ways of making the reverse complement of a sequence. As a bioinformatician, you should really be working with a library suited for bioinformatics, namely BioPython.

Then making a reverse complement looks like this:

from Bio.Seq import Seq

seq = 'ATCG'
print str(Seq(seq).reverse_complement()) # CGAT

But generating the Seq object might be overkill for this problem, so I just use a standard dictionary below. We also will want to compare the Node objects with each other, so we need to override __eq__. And because we will want to make a set of unique Node objects, we need to implement __hash__ as well. For pretty printing, we also implement __str__.

class Node(object):
    def __init__(self, seq):
        self.seq = seq
        self.revcompl = self.revcompl()

    def revcompl(self):
        complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
        return "".join(complement.get(base, base) for base in reversed(self.seq))

    def __eq__(self, other):
        return self.seq == other.seq or self.revcompl == other.seq

    def __hash__(self):
        return hash(self.seq) ^ hash(self.revcompl)

    def __str__(self):
        return '({}, {})'.format(self.seq, self.revcompl)

So now we can take our set or original nodes and turn them into our list of new nodes with their reverse complement.

nodes = set(['ACG', 'ATC', 'GAT', 'CGG', 'CGT', 'AAT', 'ATT', 'GGA', 'TTC', 'CCG', 'TCC', 'GAA'])
newnodes = set(Node(seq) for seq in nodes)
assert len(newnodes) == 6

Now we need to connect the nodes. You didn't really state in your question how you generated your list with edges. The visualization of what you posted looks indeed like you described: two disconnected graphs.

two disconnected graphs

However, when I would create a DeBruijn graph, I would pairwise compare all the sequences, see if they have any overlap between them, create an adjacency list and from that generate the DOT code for graphviz.

from itertools import product

def suffix(seq, overlap):
    return seq[-overlap:]

def prefix(seq, overlap):
    return seq[:overlap]

def has_overlap_seq(seq1, seq2, overlap=2):
    if seq1 == seq2:
        return False
    return suffix(seq1, overlap) == prefix(seq2, overlap)

def get_adjacency_list_seqs(sequences, overlap=2):
    for seq1, seq2 in product(sequences, repeat=2):
        if has_overlap_seq(seq1, seq2, overlap):
            yield seq1, seq2

def make_dot_plot(adjacency_list):
    """Creates a DOT file for a directed graph."""
    template = """digraph "DeBruijn graph"{{
{}
}}""".format
    edges = '\n'.join('"{}" -> "{}"'.format(*edge) for edge in adjacency_list)
    return template(edges)

If I do that for your original nodes,

seq_adjacency_list = get_adjacency_list_seqs(nodes)
print make_dot_plot(seq_adjacency_list)

I get a single connected graph:

single connected graph

So I'm not sure if there was an error in your original implementation of generating the edges list, or if you were trying to do something completely else.

Now moving forward, we can adapt the previous code for sequence strings, to also work with our Node objects we created earlier.

def has_overlap_node(node1, node2, overlap=2):
    if node1 == node2:
        return False
    return suffix(node1.seq, overlap) == prefix(node2.seq, overlap) or \
           suffix(node1.seq, overlap) == prefix(node2.revcompl, overlap) or \
           suffix(node1.revcompl, overlap) == prefix(node2.seq, overlap) or \
           suffix(node1.revcompl, overlap) == prefix(node2.revcompl, overlap)

def get_adjacency_list_nodes(nodes, overlap=2):
    for node1, node2 in product(nodes, repeat=2):
        if has_overlap_node(node1, node2, overlap):
            yield node1, node2

Applying this

nodes_adjacency_list = get_adjacency_list_nodes(newnodes)
print make_dot_plot(nodes_adjacency_list)

generates

nodes plot

which indeed has 6 nodes but 12 instead of the requested 10 edges.

Community
  • 1
  • 1
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • My main interest is to do de novo assembly from the reads . For that purpose only i created a simple setup i.e a genome 'ACGGAATC' and its reverse complement i.e. 'GATTCCGT'. I took all kmers of length 4 which will result in 10 kmers . As for the edges i added a edge from left k-1 mer to right k-1 mer i.e total 10edges . There is no need to look for overlaps. Now my question is given the nodes and the edges how to create a Debruijn graph like the last graph you made in the solution above and how to do eulerian walk which should give me back the resultant genome and its reverse complement. – piyush Oct 18 '15 at 13:29