1

I am wondering if there is a simple bash or AWK oneliner to get the number of repeated characters, per repeat.

For example considering this string:

AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA

Is it possible to get the number of Ns in the first repeat, the number of Ns in the second repeat, etc.?

Thanks!

Expected results, the length of each repeat on a new line.

anubhava
  • 761,203
  • 64
  • 569
  • 643
benn
  • 198
  • 1
  • 11
  • 2
    What efforts did you make? Post them even if it did not solve your problem – Inian Aug 31 '17 at 10:55
  • At a minimum at least add your expected output - all on one line, spaces or commas between, on separate lines, etc... – Ed Morton Aug 31 '17 at 12:51
  • I was satisfied with the first answer from anubhava, see comments under his answer. I added expected results, as you asked for. – benn Aug 31 '17 at 12:57
  • We're not looking for a description of the expected results (though it's fine to have that too), we're looking for the actual expected output given the input you posted. This site isn't just for you to get an answer to your question, it's a repository for others to look up their questions to find answers so it's important that a question be a complete one (see [ask]) to help everyone else in future. – Ed Morton Aug 31 '17 at 13:00

8 Answers8

6

You can use awk to split fields on each character that not N and print each field and it's length:

s='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print $i, length($i)}' <<< "$s"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

Another option is to use grep + awk:

grep -Eo 'N+' <<< "$s" | awk '{print $1, length($1)}'

And here is pure BASH solution:

shopt -s extglob
while read -r line; do
    [[ -n $line ]] && echo "$line ${#line}"
done <<< "${s//+([!N])/$'\n'}"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

BASH solution details:

  1. It uses extended glob pattern to match 1 or more non-N characters and replace them with line break in +([!N])/$'\n'}"
  2. Using a while loop we iterate through each substring of N characters
  3. Inside the loop we print each string and length of that string.
anubhava
  • 761,203
  • 64
  • 569
  • 643
4

A simple solution:

echo "$string" | grep -oE "N+" | awk '{ print $0, length}'

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

EDIT:
As per suggestion of @Ed-Morton: Changing -P to -E.
Man page of grep says -P is "highly experimental" functionality.
We don't need PCREs to use +, just EREs are sufficient.

Rahul Verma
  • 2,946
  • 14
  • 27
  • 2
    You don't need PCREs to use `+`, just EREs, so use `-E` instead of `-P` so your grep isn't relying on "highly experimental" (see the man page!) functionality. – Ed Morton Aug 31 '17 at 13:12
  • 1
    @EdMorton: Thanks Ed. Yeah I'll take care of that from next time. Let me edit too. And performance wise which is better according to you ? – Rahul Verma Aug 31 '17 at 13:35
  • 1
    PCREs use a very different algorithm/regexp engine from BREs and EREs to accommodate look ahead/behind/whatever and that engine is much slower even if you don't use any PCRE-specific features so BRE and ERE are faster than PCRE. See https://swtch.com/~rsc/regexp/regexp1.html for details. – Ed Morton Aug 31 '17 at 13:54
  • 1
    Okay. Yeah makes sense. (y). – Rahul Verma Aug 31 '17 at 13:55
3

With GNU awk for multi-char RS:

$ awk -v RS='N+' 'RT{print length(RT)}' file
5
8
7

$ awk -v RS='N+' 'RT{print RT, length(RT)}' file
NNNNN 5
NNNNNNNN 8
NNNNNNN 7
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks for your help, but I don't get results from your codes. How to use file? – benn Aug 31 '17 at 13:46
  • `file` is just a file containing the input string shown in your question. You could use `echo 'AATGATGGAANNN...' | awk -v RS='N+' 'RT{print length(RT)}'` instead. As it says, though, you've got to be using GNU awk. – Ed Morton Aug 31 '17 at 13:59
  • 1
    This could be golfed to `$0=length(RT)` – Thor Aug 31 '17 at 15:07
2

Here's a Perl one-liner:

perl -ne 'while (m/(.)(\1*)/g) { printf "%5i %s\n", length($2)+1, $1 }' <<<AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA
2 A
1 T
1 G
1 A
1 T
2 G
2 A
5 N
1 G
1 A
1 T
1 A
1 G
2 A
1 C
1 G
1 A
1 T
8 N
1 G
1 A
1 T
2 A
1 T
1 G
1 A
7 N
1 T
1 A
1 G
1 A
1 C
1 T
1 G
1 A

The m/(.)(\1*)/ successively matches as many identical characters as possible, with the /g causing the matching to pick up again on the next iteration for as long as the string still contains something which we have not yet matched. So we are looping over the string in chunks of identical characters, and on each iteration, printing the first character as well as the length of the entire matched string.

The first pair of parentheses capture a character at the beginning of the (remaining unmatched) line, and \1 says to repeat this character. The * quantifier matches this as many times as possible.

If you are interested in just the N:s, you could change the first parenthesis to (N), or you could add a conditional like printf("%7i %s\n", length($2), $1) if ($1 == "N"). Similarly, if you want only hits where there are repeats (more than one occurrence), you can say \1+ instead of \1* or add a conditional like ... if length($2) >= 1.

tripleee
  • 175,061
  • 34
  • 275
  • 318
1

As you asked for a sed solution, you can use this one if your chains of repeated characters are no longer than 9 characters and if your string doesn't contain any semicolons:

sed 's/$/;NNNNNNNNN0123456789/;:a;s/\(N\+\)\([^;]*;\1.\{9\}\)\(.\)\(.*\)/\2\3\4\n\3/;ta;s/[^\n]*\n//'

Johannes Riecken
  • 2,301
  • 16
  • 17
1

try these two:

First one

sed 's/[^N]/ /g' file | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

Second One

cat file | tr -c 'N' ' ' | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'
Abhinandan prasad
  • 1,009
  • 7
  • 13
0

Short GNU awk approach:

str='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -v FPAT='N+' '{for(i=1;i<=NF;i++) print $i,length($i)}' <<< $str

The output:

NNNNN 5
NNNNNNNN 8
NNNNNNN 7
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
-1

You could take help of the regular expression method.

This is a solution code I get from the following link

Count occurrences of a char in a string using Bash

needle=","
var="text,text,text,text"

number_of_occurrences=$(grep -o "$needle" <<< "$var" | wc -l)

as you can see we get the number of occurrences of "$needle" pretty easily with the help of WC(word count).

You can loop it to satisfy your demand.

HexaCrop
  • 3,863
  • 2
  • 23
  • 50
  • 1
    @b.nota I guarantee if you had included the expected output in your question then Kevin wouldn't have misunderstood your requirements and wasted his time posting a solution to a different problem than the one you have (and got himself downvoted for his troubles - not by me). – Ed Morton Aug 31 '17 at 13:25
  • I didn't downvote either, I appreciate all the help here! – benn Aug 31 '17 at 13:43