2

The following sed snippet will remove the duplicate letters in the string and print only the unique letters :

> echo "remove duplicate letters from string" | sed ':;s/\(.\)\(.*\)\1/\1\2/;t'
> remov duplicatsfng

What would be the regular expression to ONLY print the duplicate letters - thus unique letters are discarded ( eg: v and d ) and the letters appearing more than once should not be repeated in the output !

The result should be :

> remo lits
Adrian S.
  • 167
  • 10
  • 1
    Possible duplicate of [Regular expression to match any character being repeated more than 10 times](http://stackoverflow.com/questions/1660694/regular-expression-to-match-any-character-being-repeated-more-than-10-times) – Isaac Jan 10 '17 at 21:50
  • Why not just iterate through the string and count the number of times each character appears? –  Jan 10 '17 at 21:51
  • 1
    @Isaac: That's no duplicate. – Cyrus Jan 10 '17 at 21:52
  • 2
    Note that you used more than a regular expression: a substitution and conditional jump. – choroba Jan 10 '17 at 21:57
  • @Cyrus I beg to differ, that answer slightly tweaked yields (in javascript) `"aabbacdefgghijklmnoopq".match(/(.)\1{1,}/g) == ["aa", "bb", "gg", "oo"]` – Isaac Jan 10 '17 at 21:58
  • 1
    @Isaac: But it returns `null` for "abcab", even though `a` and `b` are duplicate. – choroba Jan 10 '17 at 22:02
  • @choroba good point, i stand corrected – Isaac Jan 10 '17 at 22:03
  • if you are interested in only the duplicate letters and not the order: `echo "remove duplicate letters from string" | grep -oi '[a-z]' | awk 'seen[$0]++ == 1'` – Sundeep Jan 11 '17 at 02:43

4 Answers4

4

You can try to do that with GNU sed:

sed -E ':a;s/(.)\1*(.+)\1+/\1\1\2/;ta;s/(((.)\3)*)./\1/g;s/.(.)/\1/g;'

details: for the string "remove duplicate letters from string"

:a;s/(.)\1*(.+)\1+/\1\1\2/;ta; : this part replaces each duplicated letters separated by at least one character with two consecutive letters. Result:

rreemmoov  duplliicattssfng

s/(((.)\3)*)./\1/g; this one removes letters that stay alone. Result:

rreemmoo  lliittss

s/.(.)/\1/g this one removes consecutive letters. Result:

remo lits

With perl:

In a more or less similar way you can write something like this:

perl -pe's/(.)(?!.*\1)//g;while(s/(.)(.*)\1+/\1\2/g){}'

It's shorter but it's probably more efficient to use this second version with the autosplit switch and a hash to count the number of occurrences for each characters:

perl -F -ane'$h{$_}++ for(@F);for(@F){if($h{$_}>1){$h{$_}=1;print}}'
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Impressive. It only works with GNU Sed, unfortunately; BSD Sed, with _extended_ regexes (`-E`), doesn't support back-references such as `\1` (in the regex itself, as opposed to in the replacement string). – mklement0 Jan 10 '17 at 22:36
  • 1
    @mklement0: The OP seems to use GNU sed, but you can do the same with perl: `perl -pe's/(.)(?!.*\1)//g;while(s/(.)(.*)\1+/\1\2/g){}'` – Casimir et Hippolyte Jan 10 '17 at 22:54
  • 1
    @mklement0: I will wait a little before adding the perl one liner, because I don't have the confirmation that this is what the OP wants, and I think there's probably smarter or more efficient ways to do it with perl. – Casimir et Hippolyte Jan 10 '17 at 23:07
  • Wow, I am impressed by the above original solution - although it looks over-engineered ! The result is correct ... I will edit the question to clarify further the fact that only letters appearing more than once are needed ! Perl solution is welcomed too ! – Adrian S. Jan 10 '17 at 23:43
  • 1
    The last perl snippet using the hash can be rewritten as 'perl -F -e '$h{$_}++ for (@F); for(keys %h){print if $h{$_} > 1 }' to avoid that '$h{$_}=1' – Adrian S. Jan 12 '17 at 21:07
  • @AdrianS.: if order doesn't matter, you can indeed write that, you can even shorten it to `perl -F -ane'$h{$_}++for@F;--$h{$_}&&print for keys%h'` – Casimir et Hippolyte Jan 12 '17 at 22:09
  • @AdrianS.: or `perl -F -anE'$h{$_}++for@F;say grep{$h{$_}>1}keys %h'` or `perl -nE'/(.)(?=.*\1)(??{$h{$1}=1})/;say keys%h'` – Casimir et Hippolyte Jan 13 '17 at 03:08
1

This will work with any awk on any system:

$ echo "remove duplicate letters from string" |
awk '{ for (i=1;i<=length($0);) { chr=substr($0,i,1); if (gsub(chr,"") > 1) printf "%c", chr } print "" }'
remo lits
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

with posix sed (and gnu)

echo "remove duplicate letters from string" | sed -e ':a' -e 's/\(\(.\).*\2.*\)\2/\1/;ta' -e "G;:b" -e '/^\(.\)\(.*\)\1\(.*\n.*\)/s//\1\2\3\1/;tb' -e 's/.//;/^\n/b e' -e 'b b' -e ':e' -e 's/.//' 

concept

  • limit occurence of letter to maximum twice ':a' -e 's/\(\(.\).*\2.*\)\2/\1/;ta'
  • add a newline (at the end) using holder buffer G
  • test if first char is there twice (before a second line), if yes put it in a second line and remove the second occurence of the letter :b" -e '/^\(.\)\(.*\)\1\(.*\n.*\)/s//\1\2\3\1/;tb

  • remove first char s/.//

  • if first char is newline, go to end of script, remove the newline (and print) /^\n/b e' ... -e ':e'
  • if not loop -e 'b b'
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43
0

This might work for you (GNU sed):

sed -r ':a;s/\n*(([^\n]).*)\2/\n\1/;ta;s/\n(.)[^\n]*/\1/g' file

While removing duplicate characters prefix those concerned, with a unique marker i.e \n. Then remove all characters not associated with a marker ( and markers too) to leave only those characters which had duplicates.

potong
  • 55,640
  • 6
  • 51
  • 83