bash - Expliciting repetitions in a sequence : how to make AACCCC into 2A4C?

Question

I am looking for a way to quantify the repetitiveness of a DNA sequence. My question is : how are distributed the tandem repeats of one single nucleotide within a given DNA sequence? To answer that I would need a simple way to "compress" a sequence where there are identical letters repeated several times.

For instance:

AAAATTCGCATTTTTTAGGTA --> 4A2T1C1G1C1A6T1A2G1T1A

From this I would be able to extract the numbers to study the distribution of the repetitions (probably a Poisson distribution I would say), like :

4A2T1C1G1C1A6T1A2G1T1A --> 4 2 1 1 1 1 6 1 2 1 1

The limiting step for me is the first one. There are some topics which give an answer to my question but I am looking for a bash solution using regular expressions.

how to match dna sequence pattern (solution in C++)
Analyze tandem repeat motifs in DNA sequences (solution in python)
Sequence Compression? (solution in Javascript)

So if my questions inspires some regex kings, it would help me a lot. If there is a software that does this I would take it for sure as well!

Thanks all, I hope I was clear enough

Egill

The "first step" you mention really is two steps: (1) You have to split the input string into an array of elements of identical letters (AAAA TT C G ....) and (looping over the array) (2) to turn a string such as CCCC into i.e. 4C. These are clearly two structurally different steps, and I don't understand from your question, at which of the two you got stuck. It would help if you would post your own attempt to solve the problem. — user1934428, Oct 20 '21 at 13:10
This is "run-length encoding". Lots of example at https://rosettacode.org/wiki/Run-length_encoding — glenn jackman, Oct 20 '21 at 13:23

Fonic · Answer 1 · 2021-10-21T10:57:08.690

As others mentioned, Bash might not be ideal for data crunching. That being said, the compression part is not that difficult to implement:

#!/usr/bin/env bash

# Compress DNA sequence [$1: sequence string, $2: name of output variable]
function compress_sequence() {
    local input="$1"
    local -n output="$2"; output=""
    local curr_char="" last_char="${input:0:1}" char_count=1 i
    for ((i=1; i <= ${#input}; i++)); do
        curr_char="${input:i:1}"
        if [[ "${curr_char}" != "${last_char}" ]]; then
            output+="${char_count}${last_char}"
            last_char="${curr_char}"
            char_count=1
        else
            char_count=$((char_count + 1))
        fi
    done
}

compress_sequence "AAAATTCGCATTTTTTAGGTA" compressed
echo "${compressed}"

This algorithm processes the sequence string character by character, counts identical characters and adds <count><char> to the output whenever characters change. I did not use regular expressions here and I'm pretty sure there wouldn't be any benefits in doing so.

I might as well add the number extracting part as it is trivial:

numbers_string="${compressed//[^0-9]/ }"
numbers_array=(${numbers_string})

This replaces everything that is not a digit with a space. The array is just a suggestion for further processing.

bash - Expliciting repetitions in a sequence : how to make AACCCC into 2A4C?

1 Answers1