How to count consecutive (repeated) character in string in bash?

Question

I am wondering if there is a simple bash or AWK oneliner to get the number of repeated characters, per repeat.

For example considering this string:

AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA

Is it possible to get the number of Ns in the first repeat, the number of Ns in the second repeat, etc.?

Thanks!

Expected results, the length of each repeat on a new line.

What efforts did you make? Post them even if it did not solve your problem — Inian, Aug 31 '17 at 10:55
At a minimum at least add your expected output - all on one line, spaces or commas between, on separate lines, etc... — Ed Morton, Aug 31 '17 at 12:51
I was satisfied with the first answer from anubhava, see comments under his answer. I added expected results, as you asked for. — benn, Aug 31 '17 at 12:57
We're not looking for a description of the expected results (though it's fine to have that too), we're looking for the actual expected output given the input you posted. This site isn't just for you to get an answer to your question, it's a repository for others to look up their questions to find answers so it's important that a question be a complete one (see [ask]) to help everyone else in future. — Ed Morton, Aug 31 '17 at 13:00

anubhava · Accepted Answer · 2017-08-31T11:48:35.283

6

You can use awk to split fields on each character that not N and print each field and it's length:

s='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print $i, length($i)}' <<< "$s"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

Another option is to use grep + awk:

grep -Eo 'N+' <<< "$s" | awk '{print $1, length($1)}'

And here is pure BASH solution:

shopt -s extglob
while read -r line; do
    [[ -n $line ]] && echo "$line ${#line}"
done <<< "${s//+([!N])/$'\n'}"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

BASH solution details:

It uses extended glob pattern to match 1 or more non-N characters and replace them with line break in +([!N])/$'\n'}"
Using a while loop we iterate through each substring of N characters
Inside the loop we print each string and length of that string.

edited Aug 31 '17 at 11:48

answered Aug 31 '17 at 10:57

anubhava

761,203
64
569
643

[See working demo](https://ideone.com/ZBsIol) What output are you getting? – anubhava Aug 31 '17 at 11:07
Another option is to use: `grep -Eo 'N{2,}' <<< "$s" | awk '{print $1, length($1)}'` – anubhava Aug 31 '17 at 11:09
This worked: `awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print length($i)}' <<< "$s"` – benn Aug 31 '17 at 11:12
yes that will work but it will show string with single `N`, if that's fine with you I will revert back my change. Also did you check working demo? – anubhava Aug 31 '17 at 11:15
Oh, I see, I was only interested in the length of the repeat. – benn Aug 31 '17 at 11:16
ok I have changed awk command and provided 2 alternative solutions as well in my answer. – anubhava Aug 31 '17 at 11:19
@anubhava can you explain the bash part please? – raam86 Aug 31 '17 at 11:45
1

@raam86: Details added in answer. – anubhava Aug 31 '17 at 11:48
1

didn't realize you are referring to `s` defined earlier, thank you for the detailed answer – raam86 Aug 31 '17 at 12:44

Rahul Verma · Answer 2 · 2017-08-31T13:39:28.137

4

A simple solution:

echo "$string" | grep -oE "N+" | awk '{ print $0, length}'

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

EDIT:
As per suggestion of @Ed-Morton: Changing -P to -E.
Man page of grep says -P is "highly experimental" functionality.
We don't need PCREs to use +, just EREs are sufficient.

edited Aug 31 '17 at 13:39

answered Aug 31 '17 at 12:54

Rahul Verma

2,946
14
27

2

You don't need PCREs to use `+`, just EREs, so use `-E` instead of `-P` so your grep isn't relying on "highly experimental" (see the man page!) functionality. – Ed Morton Aug 31 '17 at 13:12
1

@EdMorton: Thanks Ed. Yeah I'll take care of that from next time. Let me edit too. And performance wise which is better according to you ? – Rahul Verma Aug 31 '17 at 13:35
1

PCREs use a very different algorithm/regexp engine from BREs and EREs to accommodate look ahead/behind/whatever and that engine is much slower even if you don't use any PCRE-specific features so BRE and ERE are faster than PCRE. See https://swtch.com/~rsc/regexp/regexp1.html for details. – Ed Morton Aug 31 '17 at 13:54
1

Okay. Yeah makes sense. (y). – Rahul Verma Aug 31 '17 at 13:55

score 3 · Answer 3 · answered Aug 31 '17 at 13:07

3

With GNU awk for multi-char RS:

$ awk -v RS='N+' 'RT{print length(RT)}' file
5
8
7

$ awk -v RS='N+' 'RT{print RT, length(RT)}' file
NNNNN 5
NNNNNNNN 8
NNNNNNN 7

answered Aug 31 '17 at 13:07

Ed Morton

188,023
17
78
185

Thanks for your help, but I don't get results from your codes. How to use file? – benn Aug 31 '17 at 13:46
`file` is just a file containing the input string shown in your question. You could use `echo 'AATGATGGAANNN...' | awk -v RS='N+' 'RT{print length(RT)}'` instead. As it says, though, you've got to be using GNU awk. – Ed Morton Aug 31 '17 at 13:59
1

This could be golfed to `$0=length(RT)` – Thor Aug 31 '17 at 15:07

score 2 · Answer 4 · answered Aug 31 '17 at 11:34

Here's a Perl one-liner:

perl -ne 'while (m/(.)(\1*)/g) { printf "%5i %s\n", length($2)+1, $1 }' <<<AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA
2 A
1 T
1 G
1 A
1 T
2 G
2 A
5 N
1 G
1 A
1 T
1 A
1 G
2 A
1 C
1 G
1 A
1 T
8 N
1 G
1 A
1 T
2 A
1 T
1 G
1 A
7 N
1 T
1 A
1 G
1 A
1 C
1 T
1 G
1 A

The m/(.)(\1*)/ successively matches as many identical characters as possible, with the /g causing the matching to pick up again on the next iteration for as long as the string still contains something which we have not yet matched. So we are looping over the string in chunks of identical characters, and on each iteration, printing the first character as well as the length of the entire matched string.

The first pair of parentheses capture a character at the beginning of the (remaining unmatched) line, and \1 says to repeat this character. The * quantifier matches this as many times as possible.

If you are interested in just the N:s, you could change the first parenthesis to (N), or you could add a conditional like printf("%7i %s\n", length($2), $1) if ($1 == "N"). Similarly, if you want only hits where there are repeats (more than one occurrence), you can say \1+ instead of \1* or add a conditional like ... if length($2) >= 1.

score 1 · Answer 5 · answered Aug 31 '17 at 11:39

As you asked for a sed solution, you can use this one if your chains of repeated characters are no longer than 9 characters and if your string doesn't contain any semicolons:

sed 's/$/;NNNNNNNNN0123456789/;:a;s/$N\+$$[^;]*;\1.\{9\}$$.$$.*$/\2\3\4\n\3/;ta;s/[^\n]*\n//'

score 1 · Answer 6 · answered Aug 31 '17 at 12:07

1

try these two:

First one

sed 's/[^N]/ /g' file | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

Second One

cat file | tr -c 'N' ' ' | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

answered Aug 31 '17 at 12:07

Abhinandan prasad

1,009
7
13

score 0 · Answer 7 · answered Aug 31 '17 at 12:01

0

Short GNU awk approach:

str='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -v FPAT='N+' '{for(i=1;i<=NF;i++) print $i,length($i)}' <<< $str

The output:

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

answered Aug 31 '17 at 12:01

RomanPerekhrest

88,541
4
65
105

score -1 · Answer 8 · answered Aug 31 '17 at 11:19

-1

You could take help of the regular expression method.

This is a solution code I get from the following link

Count occurrences of a char in a string using Bash

needle=","
var="text,text,text,text"

number_of_occurrences=$(grep -o "$needle" <<< "$var" | wc -l)

as you can see we get the number of occurrences of "$needle" pretty easily with the help of WC(word count).

You can loop it to satisfy your demand.

answered Aug 31 '17 at 11:19

HexaCrop

3,863
2
23
50

1

@b.nota I guarantee if you had included the expected output in your question then Kevin wouldn't have misunderstood your requirements and wasted his time posting a solution to a different problem than the one you have (and got himself downvoted for his troubles - not by me). – Ed Morton Aug 31 '17 at 13:25
I didn't downvote either, I appreciate all the help here! – benn Aug 31 '17 at 13:43

How to count consecutive (repeated) character in string in bash?

8 Answers8