Generate (not so)random string with particular string occurences

Question

I have a requirement where I have the alphabet 'ACGT' and I need to create a string of around 20,000 characters. This string should contain 100+ occurrences of the pattern "CCGT". Most of the time the generated string contains it around 20-30 instances.

    int N = 20000;
    std::string alphabet("ACGT");
    std::string str;
    str.reserve(N);
    for (int index = 0; index < N; index++)
    {
        str += alphabet[rand() % (alphabet.length())];
    }

How do I tweak the code so that the pattern would appear more often?

Edit - Is there a way of changing the alphabet, i.e - 'A', 'C', 'G', 'T', 'CCGT' as characters of the alphabet?

Thank you.

Every `N / 4 / 100` iteration, insert the special sequence, while skipping four characters in `index`? — Some programmer dude, Apr 23 '15 at 10:37
The problem seems to be `rand`, I'd recommend looking at [these alternatives](http://stackoverflow.com/a/20136256/493122). — Shoe, Apr 23 '15 at 10:37
@Jefffrey I think the OP wants to weigh the randomness to generate a special sequence more often than the other, which is totally different than the duplicate, so I vote to reopen. — Some programmer dude, Apr 23 '15 at 10:40
How can you have more than one probability of a substring? I think you meant "occurrance". And how can it be random if there must be at least 100 — Skizz, Apr 23 '15 at 10:43
Every iteration where index < N - 4 generate rand() % 300 and insert CCGT if it equals 1? — Andy Newman, Apr 23 '15 at 10:46
Actually, how random does this have to be? Do you still want a fair distribution of 'A' characters, because most solutions you get here will result in there being a suspiciously low count of 'A's — Andy Newman, Apr 23 '15 at 10:48
@AndyNewman: The problem is that "rand() % 300" may never be 1 (because it's random) and so the "100+" criteria will fail — Skizz, Apr 23 '15 at 10:49
I was thinking along the lines of changing the alphabet, i.e - 'A', 'C', 'G', 'T', 'CCGT' as characters of the alphabet.. But not sure on how to achieve it and if its the best way. — Madz, Apr 23 '15 at 10:51
How about generate a random list of 100 indexes that are all at least 4 apart from each other. Sort them by increasing size. Then as you loop through where you generate the string, if `index` is equal to your next special index you insert the string and increase `index` by an extra 3 (since the loop already increments it by 1). — Kvothe, Apr 23 '15 at 10:52
Also, how random should the distribution of the other letters be? Do you want equal amounts of 'A', 'C', 'G' and 'T'? Or should their distribution be truly random (or as close as one can get to that when using `rand`)? — Kvothe, Apr 23 '15 at 11:10
@Kvothe I'd like it to be truly random.. there's really no need for equal distribution. — Madz, Apr 23 '15 at 11:12

score 2 · Answer 1 · answered Apr 23 '15 at 11:08

2

Generate an array of ints containing 100 x 0s and 490 1s, 2s, 3s and 4s [000000....111111....2222 etc] making almost 20,000 entries.

Then random shuffle it (std::random_shuffle)

Then write a string where each 0 translates to 'CCGT', each 1 translates to 'A', each 2 .... etc

I think that gives you what you want, and by tweaking the original array of ints you could change the number of 'A' characters in the output too.

Edit: If that isn't random enough, do 100 0s at the start and then random 1-4 for the rest.

answered Apr 23 '15 at 11:08

Andy Newman

1,178
2
8
23

I believe it should be 400 of 1s - 4s, currently it adds up to 2360 characters. – Kvothe Apr 23 '15 at 11:12
well, 100 0s, and (20,000 - (100 * 4)) of the rest in total - I assume magic numbers won't appear in the real code either – Andy Newman Apr 23 '15 at 11:44
This is almost the OPs original suggestion - it just simplifies the problem by treating a 'CCGT' as a single character until it gets to the end. You could as easily generate the string including 100 'X' characters and then substitute afterwards, although I think shuffling is the easiest way to get 100. – Andy Newman Apr 23 '15 at 12:14

score 1 · Answer 2 · answered Apr 23 '15 at 10:53

The only solution I can think of that would meet the "100+" criteria is:

create 20000 character string
number of instances (call it n) = 100 + some random value
for (i = 0 ; i < n ; ++i)
{
   pick random start position
   write CCGT
}

Of course, you'd need to ensure the overwritten characters weren't part of a "CCGT" already.

score 1 · Answer 3 · answered Apr 23 '15 at 11:02

My first thought would be to generate a list of 100 indexes where you will definitely insert the special string. Then as you generate the random string, insert the special string at each of these indexes as you reach them.

I've missed out checking that the intervals are spaced appropriately (cannot be within 4 of another interval) and also sorting them in ascending order - both of which would be necessary for this to work.

int N = 20000;
std::string alphabet("ACGT");
int intervals[100];
for (int index = 0; index < 100; index)
{
    intervals[index] = rand() % 2000;
    // Do some sort of check to make sure each element of intervals is not
    // within 4 of another element and that no elements are repeated
}
// Sort the intervals array in ascending order
int current_interval_index = 0;
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
    if (index == intervals[current_interval_index])
    {
        str += alphabet;
        current_interval_index++;
        index += 3;
    }
    else
    {
        str += alphabet[rand() % (alphabet.length())];
    }
}

score 1 · Answer 4 · answered Apr 23 '15 at 11:43

The solution I came up with uses a std::vector to contain all the random sets of 4 chars including the 100 special sequence. I then shuffle that vector to distribute the 100 special sequence randomly throughout the string.

To make the distribution of letters even I create an alternative alphabet string called weighted that contains a relative abundance of alphabet characters according to what has already been included from the 100 special sequence.

int main()
{
    std::srand(std::time(0));

    using std::size_t;

    const size_t N = 20000;

    std::string alphabet("ACGT");

    // stuff the ballot
    std::vector<std::string> v(100, "CCGT");

    // build a properly weighted alphabet string
    // to give each letter equal chance of appearing
    // in the final string

    std::string weighted;

    // This could be scaled down to make the weighted string much smaller

    for(size_t i = 0; i < (N - 200) / 4; ++i) // already have 200 Cs
        weighted += "C";

    for(size_t i = 0; i < (N - 100) / 4; ++i) // already have 100 Ns & Gs
        weighted += "GT";

    for(size_t i = 0; i < N / 4; ++i)
        weighted += "A";

    size_t remaining = N - (v.size() * 4);

    // add the remaining characters to the weighted string
    std::string s;
    for(size_t i = 0; i < remaining; ++i)
        s += weighted[std::rand() % weighted.size()];

    // add the random "4 char" sequences to the vector so
    // we can randomly distribute the pre-loaded special "4 char"
    // sequence among them.
    for(std::size_t i = 0; i < s.size(); i += 4)
        v.push_back(s.substr(i, 4));

    // distribute the "4 char" sequences randomly
    std::random_shuffle(v.begin(), v.end());

    // rebuild string s from vector
    s.clear();
    for(auto&& seq: v)
        s += seq;

    std::cout << s.size() << '\n'; // should be N
}

score 1 · Answer 5 · answered Apr 23 '15 at 11:43

I like the answer by @Andy Newman and think that is probably the best way - the code below is a compilable example of what they suggested.

#include <string>
#include <algorithm>
#include <iostream>

int main()
{
    srand(time(0));
    int N = 20000;
    std::string alphabet("ACGT");
    std::string str;
    str.reserve(N);
    int int_array[19700];
    // Populate int array
    for (int index = 0; index < 19700; index++)
        {
        if (index < 100)
        {
            int_array[index] = 0;
        }
        else
        {
            int_array[index] = (rand() % 4) + 1;
        }
    }
    // Want the array in a random order
    std::random_shuffle(&int_array[0], &int_array[19700]);
    // Now populate string from the int array
    for (int index = 0; index < 19700; index++)
    {
        switch(int_array[index])
        {
            case 0:
                str += alphabet;
                break;
            case 1:
                str += 'A';
                break;
            case 2:
                str += 'C';
                break;
            case 3:
                str += 'G';
                break;
            case 4:
                str += 'T';
                break;
            default:
                break;
        }
    }
    // Print out to check what it looks like
    std::cout << str << std::endl;
}

sh1 · Answer 6 · 2015-04-24T01:37:26.497

You should make N larger.

I take this liberty because you say 'create a string of around 20,000 characters'; but there's more to it than that.

If you're only finding around 20-30 instances in a string of 20000 characters then something is wrong. A ballpark estimate is to say that there are 20000 character positions to test, and at each of these there will be a four-letter string from an alphabet of four letters, giving a 1/256 chance of it being a specific string. The average should be (approximately; because I've oversimplified) 20000/256, or 78 hits.

It could be that your string isn't properly randomised (likely due to the use of the modulo idiom), or perhaps you're testing only every fourth character position -- as if the string were a list of non-overlapping four-letter words.

If you can bring your average hit rate back up to 78, then you can reach a little further to your 100 requirement simply by increasing N proportionally.

Generate (not so)random string with particular string occurences

6 Answers6