5

I need to optimize calculation of the frequencies of gametes in populations.

I have np populations and Ne individuals in each population. Each individual is formed by two gametes (male and female). Each gamete contains three genes. Each gen may be 0 or 1. So each individual is a 2x3 matrix. Each row of the matrix is a gamete given by one of the parents. The set of individuals in each population may be arbitrary (but always of Ne length). For simplicity initial populations with individuals may be given as:

Ne = 300; np = 3^7;
(*This table may be arbitrary with the same shape*)
ind = Table[{{0, 0, 0}, {1, 1, 1}}, {np}, {Ne}]

Full set of all possible gametes:

allGam = Tuples[{0, 1}, 3]

Each individual can generate a gamete by 8 possible ways with equal probability. These gametes are: Tuples@Transpose@ind[[iPop, iInd]] (where iPop and iInd - indexes of population and of individual in that population). I need to calculate the frequencies of gametes generated by individuals for each population.

At this moment my solution is as follows.

At first, I convert each individual into gametes it can produce:

gamsInPop = Map[Sequence @@ Tuples@Transpose@# &, ind, {2}]

But more efficient way to do this is:

gamsInPop = 
 Table[Join @@ Table[Tuples@Transpose@ind[[i, j]], {j, 1, Ne}], {i, 1, np}]

Secondly, I calculate the frequencies of gametes produced including zero frequencies for gametes that are possible but absent in population:

gamFrq = Table[Count[pop, gam]/(8 Ne), {pop, gamInPop}, {gam, allGam}]

More efficient version of this code:

gamFrq = Total[
   Developer`ToPackedArray[
    gamInPop /. Table[
      allGam[[i]] -> Insert[{0, 0, 0, 0, 0, 0, 0}, 1, i], {i, 1, 
       8}]], {2}]/(8 Ne)

Unfortunately, the code is still too slow. Can anybody help me to speed-up it?

Mr.Wizard
  • 24,179
  • 5
  • 44
  • 125
Alexey Popkov
  • 9,355
  • 4
  • 42
  • 93

1 Answers1

6

This code:

Clear[getFrequencies];
Module[{t = 
   Developer`ToPackedArray[
     Table[FromDigits[#, 2] & /@ 
         Tuples[Transpose[{
            PadLeft[IntegerDigits[i, 2], 3], 
            PadLeft[IntegerDigits[j, 2], 3]}]], 
       {i, 0, 7}, {j, 0, 7}]
    ]},
   getFrequencies[ind_] :=
    With[{extracted = 
       Partition[
          Flatten@Extract[t, Flatten[ind.(2^Range[0, 2]) + 1, 1]], 
          Ne*8]},
        Map[
         Sort@Join[#, Thread[{Complement[Range[0, 7], #[[All, 1]]], 0}]] &@Tally[#] &, 
         extracted
        ][[All, All, 2]]/(Ne*8)
    ]
]

utilizes conversion to decimal numbers and packed arrays, and speeds your code up by a factor of 40 on my machine. The benchmarks:

In[372]:= Ne=300;np=3^7;
(*This table may be arbitrary with the same shape*)
inds=Table[{{0,0,0},{1,1,1}},{np},{Ne}];

In[374]:= 
getFrequencies[inds]//Short//Timing
Out[374]= {0.282,{{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8},<<2185>>,
{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8}}}

In[375]:= 
Timing[
  gamsInPop=Table[Join@@Table[Tuples@Transpose@inds[[i,j]],{j,1,Ne}],{i,1,np}];
  gamFrq=Total[Developer`ToPackedArray[gamsInPop/.Table[allGam[[i]]->
         Insert[{0,0,0,0,0,0,0},1,i],{i,1,8}]],{2}]/(8 Ne)//Short]

Out[375]= {10.563,{{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8},<<2185>>,
  {1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8}}}

Note that in general (for random populations), the ordering of frequencies in your and my solutions are for some reason different, and

In[393]:= fr[[All,{1,5,3,7,2,6,4,8}]] == gamFrq
Out[393]= True

Now, some explanation: first, we create a table t, which is constructed as follows: each gamete is assigned a number from 0 to 7, which corresponds to the zeros and ones in it treated as binary digits. The table then has the possible gametes produced by an individual, stored in a position {i,j}, where i is a decimal for mother's gamete (say), and j - for fathers's, for that individual (each individual is uniquely identified by a pair {i,j}). The gametes produced by individual are also converted to decimals. Here is how it looks:

In[396]:= t//Short[#,5]&
Out[396]//Short= {{{0,0,0,0,0,0,0,0},{0,1,0,1,0,1,0,1},{0,0,2,2,0,0,2,2},
{0,1,2,3,0,1,2,3},{0,0,0,0,4,4,4,4},{0,1,0,1,4,5,4,5},{0,0,2,2,4,4,6,6},
{0,1,2,3,4,5,6,7}},<<6>>,{{7,6,5,4,3,2,1,0},{7,7,5,5,3,3,1,1},{7,6,7,6,3,2,3,2},
<<2>>,{7,7,5,5,7,7,5,5},{7,6,7,6,7,6,7,6},{7,7,7,7,7,7,7,7}}}

A very important (crucial) step is to convert this table to a packed array.

The line Flatten[ind.(2^Range[0, 2]) + 1, 1]] converts parents' gametes from binary to decimal for all individuals at once, in all populations, and adds 1 so that these become indices at which the list of possible to produce gametes is stored in a table t for a given individual. We then Extract all of them at once, for all populations, and use Flatten and Partition to recover back the population structure. Then, we compute frequencies with Tally, append missing gametes with frequencies zero (done by Join[#, Thread[{Complement[Range[0, 7], #[[All, 1]]], 0}]] line), and Sort each frequency list for a fixed population. Finally, we extract the frequencies and discard the gamete decimal index.

All operations are pretty fast since performed on packed arrays. The speedup is due to the vectorized formulation of the problem and use of packed arrays. It is also much more memory - efficient.

Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
  • @Sjoerd This is a bug, remained from testing, I just fixed it. – Leonid Shifrin Aug 25 '11 at 12:13
  • Could you explain why you pack it all in `Module`? Is that just to have a localized t$398339 variable? – Sjoerd C. de Vries Aug 25 '11 at 12:30
  • @Sjoerd Yes, just for that. It does not have to be recomputed every time the function is called, and allowing the function to refer to (implicitly depend on) a global variable is IMO a bad practice - I explained my view on this issue in detail here:http://stackoverflow.com/questions/6236458/plot-using-with-versus-plot-using-block-mathematica/6236808#6236808 – Leonid Shifrin Aug 25 '11 at 12:35
  • @Leonid Does the list of indexes in `fr[[All,{1,5,3,7,2,6,4,8}]]` always give the same ordering of frequencies as originally? The exact correspondence between a gamete and its frequency is necessary. – Alexey Popkov Aug 25 '11 at 14:26
  • @Alexey I did not test extensively, but it should, since the correspondence between a gamete and its index, as well as the sorting procedure, do not depend on the dataset. – Leonid Shifrin Aug 25 '11 at 14:58
  • I can see your point but isn't this t$67639 just as global as any other variable, be it with an unlikelier name? And, as a developer you'll execute various versions of this code numerous times during development, which means filling up name space and memory with various copies of t$... (with `Names["Global`t*"]` I get {"t", "table", "t$430055", "t$430143", "t$430287"}). t$430287 is rather small now, but what if it is an MRI stack? – Sjoerd C. de Vries Aug 25 '11 at 14:59
  • @Sjoerd This is a final version, so it is supposed to be executed only once. During the development, I leave `Module` out, keeping `t` global. But there is no problem in multiple execution - only the last definition is kept. The point is, `t` is not exposed to the top-level, so someone who wants to change it can only do so by going inside that `Module`, but not from anywhere else in the program. This makes it less likely to produce unnoticed changes which leads to bugs. – Leonid Shifrin Aug 25 '11 at 15:03
  • @Leonid I just has figured out how the code works. Very clever, thank you! – Alexey Popkov Aug 26 '11 at 05:13