I am looking for a way to quantify the repetitiveness of a DNA sequence. My question is : how are distributed the tandem repeats of one single nucleotide within a given DNA sequence? To answer that I would need a simple way to "compress" a sequence where there are identical letters repeated several times.
For instance:
AAAATTCGCATTTTTTAGGTA --> 4A2T1C1G1C1A6T1A2G1T1A
From this I would be able to extract the numbers to study the distribution of the repetitions (probably a Poisson distribution I would say), like :
4A2T1C1G1C1A6T1A2G1T1A --> 4 2 1 1 1 1 6 1 2 1 1
The limiting step for me is the first one. There are some topics which give an answer to my question but I am looking for a bash solution using regular expressions.
- how to match dna sequence pattern (solution in C++)
- Analyze tandem repeat motifs in DNA sequences (solution in python)
- Sequence Compression? (solution in Javascript)
So if my questions inspires some regex kings, it would help me a lot. If there is a software that does this I would take it for sure as well!
Thanks all, I hope I was clear enough
Egill