4

Hi Stack Overflow community. I think I am trying to code the impossible with matplotlib, so if there is a different python library that will better suit me, please let me know!

I have an entire amino acid sequence (Represented as capital letters in the image) of a protein (protein x). This will be my x axis.

I have two excel columns: Disease and Control. These columns contain parts of the whole protein x's amino acid sequence. Sometimes there are multiple hits where the disease or control column will contain two of an identical amino acid section of protein x. I want these to stack on top of each other so that one can see how many hits the disease and control have on protein x.

Confusing? sorry, here's a sample of what I was able to come up with using powerpoint.

Amino Acid Comparison

The black text is the reference sequence. Purple is control. Pink is disease. Make sense now?

I need to do this with a HUGE dataset, so no, I do not want to "just use powerpoint for hours". I also want to do it with any reference sequence of my choosing.

I'm not asking someone to do my job for me. I need someone to point me in the right direction. Is there a special library? Should I be converting everything into numbers and then relabeling as text?

Thanks and I appreciate any advice.

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
Alex Nesta
  • 393
  • 2
  • 13
  • Hi Alex. You may try the Biopython bioinformatics package (http://biopython.org/wiki/AlignIO) for sequence alignment if you need to perform the sequence alignment analysis over a large data set. For the visualization, I'm not sure that Python has a tool designed to meet your specific needs. Matplotlib is an incredibly flexible tool and could be used to reproduce your sample plot, but it would be helpful to have sample data and the code you've tried so far. – Brian May 19 '17 at 03:16
  • What precisely do your Disease and control columns? Nothing you said makes sense unless you are leaving out a gigantic alignment detail or the columns are the same size as X's length and correspond one to one to each amino acid in it. @Brian, is he not explaining that the columns aren't one to one with the X's length and instead individual sequences of their own? – Krupip May 19 '17 at 03:17
  • Thanks for your help and suggestions so far. Yes, I will need to also figure out how to align each row of the disease and control column to protein x's sequence. The disease and control column rows contain sequences that are between 5-15 letters in length that should align onto a section of protein x perfectly (this is seen in the image as purple and pink). I think I will be able to do that with biopython, but it's hard to imagine how I would visualize the aligned data. – Alex Nesta May 19 '17 at 11:01
  • For customized graphics, I sometimes use python to programmatically generate TikZ code. This may result in large pdf file though. – bli May 20 '17 at 04:29

2 Answers2

2

Create an SVG image, which is an XML text, using a script. I will tackle something simpler!

Suppose your target is this. overall image

Begin by breaking the big string at each place where there will be a column of string fragments, in this case, at 'EF' and 'IJKL'. You can position the fragments of the big string using features of the SVG XML (more presently). Since you know the beginning positions of the fragments and the heights of characters you can position layers in the columns.

This is the kind of thingy you would have to build.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->

<svg
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:cc="http://creativecommons.org/ns#"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:svg="http://www.w3.org/2000/svg"
   xmlns="http://www.w3.org/2000/svg"
   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
   width="210mm"
   height="297mm"
   viewBox="0 0 210 297"
   version="1.1"
   id="svg8"
   inkscape:version="0.92.0 r15299"
   sodipodi:docname="genes.svg">
  <defs
     id="defs2" />
  <sodipodi:namedview
     id="base"
     pagecolor="#ffffff"
     bordercolor="#666666"
     borderopacity="1.0"
     inkscape:pageopacity="0.0"
     inkscape:pageshadow="2"
     inkscape:zoom="1.4"
     inkscape:cx="170.60599"
     inkscape:cy="341.08014"
     inkscape:document-units="mm"
     inkscape:current-layer="layer1"
     showgrid="false"
     inkscape:window-width="1095"
     inkscape:window-height="676"
     inkscape:window-x="145"
     inkscape:window-y="122"
     inkscape:window-maximized="0" />
  <metadata
     id="metadata5">
    <rdf:RDF>
      <cc:Work
         rdf:about="">
        <dc:format>image/svg+xml</dc:format>
        <dc:type
           rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
        <dc:title></dc:title>
      </cc:Work>
    </rdf:RDF>
  </metadata>
  <g
     inkscape:label="Layer 1"
     inkscape:groupmode="layer"
     id="layer1">
    <text
       xml:space="preserve"
       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:9.8777771px;line-height:6.61458302px;font-family:Courier;-inkscape-font-specification:Courier;font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332;"
       x="24.588797"
       y="179.4014"
       id="text12"><tspan
         sodipodi:role="line"
         id="tspan10"
         x="24.588797"
         y="185.32886"
         style="stroke-width:0.26458332;-inkscape-font-specification:Courier;font-family:Courier;font-weight:normal;font-style:normal;font-stretch:normal;font-variant:normal;" /></text>
    <text
       xml:space="preserve"
       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:9.8777771px;line-height:6.61458302px;font-family:Calibri;-inkscape-font-specification:'Calibri, Normal';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
       x="23.8125"
       y="207.41963"
       id="text24"><tspan
         sodipodi:role="line"
         id="tspan22"
         x="23.8125"
         y="207.41963"
         style="stroke-width:0.26458332">ABCD</tspan></text>
    <text
       xml:space="preserve"
       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:9.8777771px;line-height:6.61458302px;font-family:Calibri;-inkscape-font-specification:'Calibri, Normal';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
       x="46.302082"
       y="207.41965"
       id="text28"><tspan
         sodipodi:role="line"
         id="tspan26"
         x="46.302082"
         y="207.41963"
         style="stroke-width:0.26458332">EFGH</tspan></text>
    <text
       xml:space="preserve"
       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:9.8777771px;line-height:6.61458302px;font-family:Calibri;-inkscape-font-specification:'Calibri, Normal';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
       x="67.657738"
       y="207.41963"
       id="text32"><tspan
         sodipodi:role="line"
         id="tspan30"
         x="67.657738"
         y="207.41963"
         style="stroke-width:0.26458332">IJKLMN</tspan></text>
    <text
       xml:space="preserve"
       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:9.8777771px;line-height:6.61458302px;font-family:Calibri;-inkscape-font-specification:'Calibri, Normal';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
       x="46.680061"
       y="199.67113"
       id="text36"><tspan
         sodipodi:role="line"
         id="tspan34"
         x="46.302082"
         y="199.67113"
         style="stroke-width:0.26458332">EF</tspan></text>
    <text
       xml:space="preserve"
       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:9.8777771px;line-height:6.61458302px;font-family:Calibri;-inkscape-font-specification:'Calibri, Normal';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
       x="67.846725"
       y="192.86755"
       id="text40"><tspan
         sodipodi:role="line"
         id="tspan38"
         x="67.657738"
         y="192.86755"
         style="stroke-width:0.26458332">IJKL</tspan></text>
  </g>
</svg>

Obviously I've done it in Inkscape but you'll get the idea. There's nothing here that can't be done in Python

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0

I'm not entirely sure what you are trying to do, so I'll repeat what I think you are saying.

You have a string of characters (A -> T?) which represent an arbitrary protein (lets call it X), each letter corresponding to one of the 20 amino acids.

You also have a table which has two columns Control and Disease, and each element in the columns is in order but must be aligned to X's sequence. You didn't ask about performing alignment, and alignment is an entirely different question in its own right so I'm going to focus on the visualization of your data.

You want to take the X aligned sequences of Control and Disease and visually compare them on top of X.

You have really three choices.

  • use matplotlib's text functionality and after performing the match load the text in text objecsts to be displayed (probably the most difficult of the options I present)

  • use python QT interface and do the same but with text boxes (where you will get automatic scroll functionality) (you can use QT designer to do this easily) then use setHtml and use html formatting around your text to get proper coloration. Additionally you could also use Tkinter and do a similar thing.

  • The simplest solution, just make a text file with what you want, you forgo the coloration but you can much more easily just create an array of the same size as X's amino acid length and set individual characters in there, then write everything in a file where, if using uniform text size font, you can see where the amino acids line up.

You could also display this in an HTML page if you instead used HTML to display, but then you have to do more work to create scrollable areas (but you can colorize the text) and this stops being python at all.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Krupip
  • 4,404
  • 2
  • 32
  • 54
  • Thanks for the suggestions. I'm thinking that I may need to use the QT interface and Tkinter. I will look more into the alignment situation tonight and see how each of these options stack up. – Alex Nesta May 19 '17 at 11:07