chaos game for DNA sequences

Question

I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

which is like this:

genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]

the fasta sequence that I have is just a sequence of letters like AACCTTTGATCAAA and the graph to be generated comes like this:

enter image description here

the code works fine with small sequences, but when I want to put a huge sequence, for example almost 40Mb of a chromosome, the program takes a lot of time and only displays a black square so that it is impossible to analyze. Is it possible to improve the aforementioned code, so that the square in which it would be displayed it would be bigger?, by the way the square must be only the square unit. Thanks for your help in advance

Can you post/link to your large sample dataset, so we have something to test with? — Szabolcs, Nov 04 '11 at 12:52
40Mb of ASCII chromosome seems like a lot of bytes for mathematica to chew through sequentially. This problem looks to be straightforward to solve using a map/reduce technique. I'm not familiar with mathematica but [these functions appear to be available](https://github.com/fmeinberg/MapReduce). — Alec Wenzowski, Nov 04 '11 at 12:56
@Alexander Doesn't the nature of the problem force us to use a sequential solution? The coordinates of point `n+1` are calculated from the coordinates of point `n`, hence there's no other way but to do it sequentially. Unless perhaps one precomputes the transformation *functions* (not results) for certain common character sequences, but that sounds like a lot of work, if feasible at all. — Szabolcs, Nov 04 '11 at 13:40
@Szabolcs looks like time isn't an issue in building the dataset. By my reckoning there are two problems occurring simultaneously for each data point: solving for an x in (-0.5, 0.5) and solving for a y in (-0.5, 0.5). — Alec Wenzowski, Nov 04 '11 at 14:16
We had an earier question on the DNA Chaos game: http://stackoverflow.com/q/5610265/615464 — Sjoerd C. de Vries, Nov 04 '11 at 14:25

Szabolcs · Accepted Answer · 2011-11-04T18:46:15.523

Summary of the incremental edits below:

This will give you a considerable speedup in computing the point coordinates by using compiled code (50x excluding computing shifts):

shifts = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};
fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], CompilationTarget -> "C"]
pts = Transpose[fun1d /@ Transpose[shifts]];

The bottleneck in your code is actually rendering the graphic, we instead of plotting each point, we'll visualize the density of points:

threshold = 1;
With[{size = 300}, 
 Image[1 - UnitStep[BinCounts[pts, 1/size, 1/size] - threshold]]
]

A region will be coloured black if it has at least threshold points. size is the image-dimension. By either choosing a large size or a large threshold you can avoid the "black square problem".

My original answer with more details:

On my rather dated machine, the code is not very slow.

chars = RandomChoice[{"A", "T", "C", "G"}, 800000];

f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
Timing[pts = FoldList[f, {0.5, 0.5}, chars];]
Graphics[{PointSize[Tiny], Point[pts]}]

I get a timing of 6.8 seconds, which is usable unless you need to run it lots of times in a loop (if it's not fast enough for your use case and machine, please add a comment, and we'll try to speed it up).

Rendering the graphic unfortunately takes much longer than this (36 seconds), and I don't know if there's anything you can do about it. Disabling antialiasing may help a little bit, depending on your platform, but not much: Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing -> False] (for me it doesn't). This is a long-standing annoyance for many of us.

Regarding the whole graphic being black, you can resize it using your mouse and make it bigger. The next time you evaluate your expression, the output graphic will remember its size. Or just use ImageSize -> 800 as a Graphics option. Considering the pixel density of screens the only other solution that I can think of (that doesn't involve resizing the graphic) would be to represent pixel density using shades of grey, and plot the density.

EDIT:

This is how you can plot the density (this is also much much faster to compute and render than the point-plot!):

With[{resolution = 0.01}, 
 ArrayPlot@BinCounts[pts, resolution, resolution]
]

Play with the resolution to make the plot nice.

For my random-sequence example, this only gives a grey plot. For your genome data it will probably give a more interesting pattern.

EDIT 2:

Here's a simple way to speed up the function using compilation:

First, replace the characters by the shift vectors (has to be done only once for a dataset, then you can save the result):

arr = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};

Then let's compile our function:

fun = Compile[{{a, _Real, 2}}, FoldList[#/2 + #2 &, {.5, .5}, a], 
 CompilationTarget -> "C"]

Remove CompilationTarget if your version of Mathematica is earlier than 8 or you don't have a C compiler installed.

fun[arr]; // Timing

gives me 0.6 seconds, which is an instant 10x speedup.

EDIT 3:

Another ~5x speedup is possible compared to the above compiled version by avoiding some kernel callbacks in the compiled function (I checked the compilation output using CompilePrint to come up with this version --- otherwise it's not obvious why it's faster):

fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], 
  CompilationTarget -> "C"]

arrt = Transpose[arr];
Timing[result = fun1d /@ arrt;]
pts = Transpose[result];

This runs in 0.11 seconds on my machine. On a more modern machine it should finish in a few seconds even for a 40 MB dataset.

I split off the transpositions into separate inputs because at this point the running time of fun1d starts to get comparable to the running time of Transpose.

Thank you very much Szabolcs, I have tried the versions that used compilation and they work in a great way, by the way could you explain a little bit more why you use Transpose in the last update. — Layla, Nov 04 '11 at 20:42
Also how can I put a different colour to each nucleotide A,C,T,G; that would help me to see the differences between certain areas. I am just starting in Mathematica so maybe this questions would seen very easy for the people here in this forum. — Layla, Nov 04 '11 at 20:44
@Manolo Can you please explain what you mean by "putting a different colour" to each? Do you mean that the points should be coloured according to which nucleotide was used to transform them? That would just colour the 4 quadrants of the plot separately, so perhaps you mean something else. — Szabolcs, Nov 05 '11 at 19:50
@Manolo I used `Transpose` to convert a list of `n` number-pairs to 2 length-`n` vectors (x and y coordinates), then transform back. Because of how `Compile` works, processing the x and y coordinates separately will be compiled to faster code than processing them together. This is because I can't tell `Compile` that the input is a vector of number-pairs. I can merely tell it that the input is a rank 2 tensor (matrix), and the code it generates is optimized for a matrix of generic size, not an `n` by `2` matrix (which we have). — Szabolcs, Nov 05 '11 at 19:53
Thank you for your replies @Szabolcs, well I wanted the graph to look something like this: http://3.bp.blogspot.com/-LwD3K4qfUrA/TjzFa8HBbZI/AAAAAAAAAF0/4zWdU7eh0s8/s1600/163965394_CGR.bmp that would mean for example that when the letter is A it colour it blue, C green, and so on. How I can do that? another thing I will put the final version of the program with the modifications that you told me here, the speed has reduced, for big sequences approx 25 Mb of compressed data is like 30 min, but for longer sequences the kernel of mathematica shuts down, maybe lack of memory. — Layla, Nov 06 '11 at 15:28
@Manolo can you explain in more detail how those points were coloured? Taking your original example, I think that e.g. the function `f[x_, "T"] := x/2 + {1/2, 0}` will *always* take the point in the top-right quadrant, no matter where the original `x` was. This the top-right quadrant would be T-only, the bottom-left one A-only, etc. Where do I go wrong? — Szabolcs, Nov 07 '11 at 15:50
Hi another question @Szabolcs, how much processing time you got for a sequence of approximate 90 Mb, I am using a fasta sequence compressed in gz format, and the waiting time is almost one hour, why do you think that is this bottleneck? thanks — Layla, Nov 11 '11 at 20:10

chaos game for DNA sequences

1 Answers1

Linked