0

I would like to output the number of repeats of a pattern with regex. For example, convert "aaad" to "3xad", "bCCCCC" to "b5xC". I want to do this in sed or awk.

I know I can match it by (.)\1+ or even capture it by ((.)\1+). But how can I obtain the times of repeating and insert that value back to string in regex or sed or awk?

Wang
  • 7,250
  • 4
  • 35
  • 66
  • 2
    Counting in sed is extremely cumbersome, see for example https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc – Benjamin W. Sep 19 '18 at 12:22
  • @BenjaminW. Thanks never know counting is that difficult in sed. Any other program can do this easily? – Wang Sep 19 '18 at 12:24
  • Perl comes to mind. – Benjamin W. Sep 19 '18 at 12:25
  • awk's fine too. – revo Sep 19 '18 at 12:25
  • @revo can you provide an answer based on awk? Thanks! – Wang Sep 19 '18 at 12:28
  • Wang - when you post your missing [mcve] make sure to include examples where the same character occurs in mixed case (e.g. foocCbar) and where the pattern you need to count repetitions of is multi-character (or change the word "pattern" to "character" or "letter" or whatever it is you really mean). Add cases with multiple repeat "patterns" in single string occur too and show how non-letters are to be handle. It'd also help if we understood what you're trying to do with this - I mean if input `bbcc` and `bb2xc` and `2xbcc` and `2xb2xc` all produce the same output of `2xb2xc` then whats the use? – Ed Morton Sep 19 '18 at 19:20

4 Answers4

4

Perl to the rescue!

perl -pe 's/((.)\2+)/length($1) . "x$2"/ge'
  • -p reads the input line by line and prints it after processing
  • s/// is the substitution similar to sed
  • /e makes the replacement evaluated as code

e.g.

aaadbCCCCCxx -> 3xadb5xC2xx
choroba
  • 231,213
  • 25
  • 204
  • 289
2

In GNU awk:

$ echo aaadbCCCCCxx |  awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) {
        c=$i
        match(substr($0,i),c"+")
        b=b (RLENGTH>1?RLENGTH "x":"") c
    }
    print b
}'
3xadb5xC2xx

If the regex metachars want to be read as literal characters as noted in the comments one could try to detect and escape them (solution below is only directional):

$ echo \\\\\\..**aaadbCCCCC++xx |
awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) { 
        c=$i                               
        # print i,c                        # for debugging
        if(c~/[*.\\]/)                     # if c is a regex metachar (not complete)
            c="\\"c                        # escape it
        match(substr($0,i),c"+")           # find all c:s
        b=b (RLENGTH>1?RLENGTH "x":"") $i  # buffer to b
    }
    print b
}'
3x\2x.2x*3xadb5xC2x+2xx
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • 1
    That would misbehave if `$i` was a regexp metachar such as `.` or an escape char `\\`. It's unclear if the OP can have non-alphabetic chars in their input or not though so idk if it's a real issue or not. – Ed Morton Sep 19 '18 at 19:40
  • ... AND it supports regex... ;D – James Brown Sep 19 '18 at 21:16
  • wrt the 2nd script - escaping characters turns some of them into the chars they represent when escaped rather than literal, e.g. `t` -> `\t` = ``. Try `printf 'foo\tbar\n' | awk '{c="t"; c="\\"c; print match($0,c)}'`. You need to put all chars except `^` inside square brackets instead and you need to escape only `^`. See the answers at https://stackoverflow.com/q/29613304/1745001 which do this job for sed. – Ed Morton Sep 20 '18 at 15:22
1

Just for fun.

With sed it is cumbersome but do-able. Note this example relies on GNU sed (:

parse.sed

/(.)\1+/ {
  : nextrepetition
  /((.)\2+)/ s//\n\1\n/             # delimit the repetition with new-lines
  h                                 # and store the delimited version
  s/^[^\n]*\n|\n[^\n]*$//g          # now remove prefix and suffix
  b charcount                       # count repetitions
  : aftercharcount                  # return here after counting
  G                                 # append the new-line delimited version

  # Reorganize pattern space to the desired format
  s/^([^\n]+)\n([^\n]*)\n(.)[^\n]+\n/\2\1x\3/

  # Run again if more repetitions exist
  /(.)\1+/b nextrepetition
}

b

# Adapted from the wc -c example in the sed manual
# Ref: https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc
: charcount

s/./a/g

# Do the carry.  The t's and b's are not necessary,
# but they do speed up the thing
t a
: a;  s/aaaaaaaaaa/b/g; t b; b done
: b;  s/bbbbbbbbbb/c/g; t c; b done
: c;  s/cccccccccc/d/g; t d; b done
: d;  s/dddddddddd/e/g; t e; b done
: e;  s/eeeeeeeeee/f/g; t f; b done
: f;  s/ffffffffff/g/g; t g; b done
: g;  s/gggggggggg/h/g; t h; b done
: h;  s/hhhhhhhhhh//g

: done

# On the last line, convert back to decimal

: loop
/a/! s/[b-h]*/&0/
s/aaaaaaaaa/9/
s/aaaaaaaa/8/
s/aaaaaaa/7/
s/aaaaaa/6/
s/aaaaa/5/
s/aaaa/4/
s/aaa/3/
s/aa/2/
s/a/1/

y/bcdefgh/abcdefg/
/[a-h]/ b loop

b aftercharcount

Run it like this:

sed -Ef parse.sed infile

With an infile like this:

aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

The output is:

3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa
Thor
  • 45,082
  • 11
  • 119
  • 130
  • `echo 'xx' | sed -Ef parse.sed` seems to send that into an infinite loop. – Ed Morton Sep 19 '18 at 19:27
  • @EdMorton: this stems from the choice of repetition indicator (x) and that my solution looks at the whole string after each replacement. Either choose a different indicator or modify the solution to only look at the rest of the string – Thor Sep 19 '18 at 23:31
  • That was complete and utter dumb luck, I had no idea there was anything special about an `x`! Are there any other characters or strings that aren't allowed to appear in the input? I couldn't modify that script if I wanted to - way too complicated for my sed abilities! – Ed Morton Sep 19 '18 at 23:35
  • @EdMorton: No. Thinking about the repetition indicator issue, I realized that having plural lettered repetitions with the same number, e.g. 11, 22, etc., would also cause erroneous output. The latter solution I suggested above seems to be the correct course of action, it would however complicate things further :-). I may take a stab at it when I have more procrastination time – Thor Sep 19 '18 at 23:44
  • You will be well and truly mentally exercised at the end of this endeavour :-). I'm looking forward to the OP telling us that her "patterns" aren't necessarily single characters and can actually be multi-character strings... that will make things a whole lot more interesting. – Ed Morton Sep 20 '18 at 00:20
1

I was hoping we'd have a MCVE by now but we don't so what the heck - here is my best guess at what you're trying to do:

$ cat tst.awk
{
    out = ""
    for (pos=1; pos<=length($0); pos+=reps) {
        char = substr($0,pos,1)
        for (reps=1; char == substr($0,pos+reps,1); reps++);
        out = out (reps > 1 ? reps "x" : "") char
    }
    print out
}

$ awk -f tst.awk file
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

The above was run against the sample input that @Thor kindly provided:

$ cat file
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

The above will work for any input characters using any awk in any shell on any UNIX box. If you need to make it case-insensitive just throw a tolower() around each side of the comparison in the innermost for loop. If you need it to work on multi-character strings then you'll have to tell us how to identify where the substrings you're interested in start/end.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185