0

For some overall context, I have a set of probabilities for each type of 3 letter substring (64 combinations in total) in my sequence to change to another given 3 letter substring. I want to be able to apply these probabilities onto every 3 letter substring in my sequence and change them if the probability indicates that it should.

Essentially, I want to randomly change 3 letter substrings within a very large string to another 3 letter substring based on known probabilities.

For example:

I have a string.

'GACTCGTAGCTAGCTG'

I have some probabilities for the substring 'GAC'

{'GAC>GAT': 0.05, 'GAC>GAG': 0.01 'GAC>GAA':0.03}

In this case, I would have a 5% chance of 'GAC' in my string to change to 'GAT', 1% chance 'GAC' changes to 'GAG' and 3% change 'GAC' changes to 'GAA' . What is the most efficient way to apply these probabilities for each 3 letter substring in my giant string.

David Chen
  • 21
  • 1

1 Answers1

0

Ok, the code below should do the trick. I cleaned up your dictionary to just have the replacement values.

What the code does is finds all of the places in the long string you give it where there is a "GAC" and then for each of those places it randomly chooses what to replace it with (that is why I included "GAC" in the dictionary - so it will replace "GAC" with "GAC" 91% of the time). Then random_replace returns the updated string.

Note that the annotations of str and dict are just to help you understand what to pass in and not necessary if you don't want them.

import re
import random

test_string = 'GAC' * 100

replace_map = {'GAT': 0.05, 'GAG': 0.01, 'GAA': 0.03, 'GAC': 0.91}

def random_replace(to_replace: str, full_string: str, replace_map: dict) -> str:
    replace_indices = [i.start() for i in re.finditer(to_replace, full_string)]
    population, weights = list(zip(*replace_map.items()))
    print(population, weights)

    for i in replace_indices:
        full_string = full_string[:i] + random.choices(population, weights)[0] + full_string[i+len(to_replace):]

    return full_string

if __name__ == "__main__":
    print(random_replace("GAC", test_string, replace_map))

To learn more about random.choices, reference this SO post.

To learn more about why I use zip to create two lists from the keys and values of the dictionary, look here.

Jack Moody
  • 1,590
  • 3
  • 21
  • 38