Regular expression to ONLY print duplicate letters in string

Question

The following sed snippet will remove the duplicate letters in the string and print only the unique letters :

> echo "remove duplicate letters from string" | sed ':;s/\(.\)\(.*\)\1/\1\2/;t'
> remov duplicatsfng

What would be the regular expression to ONLY print the duplicate letters - thus unique letters are discarded ( eg: v and d ) and the letters appearing more than once should not be repeated in the output !

The result should be :

> remo lits

Possible duplicate of [Regular expression to match any character being repeated more than 10 times](http://stackoverflow.com/questions/1660694/regular-expression-to-match-any-character-being-repeated-more-than-10-times) — Isaac, Jan 10 '17 at 21:50
Why not just iterate through the string and count the number of times each character appears? — , Jan 10 '17 at 21:51
Note that you used more than a regular expression: a substitution and conditional jump. — choroba, Jan 10 '17 at 21:57
@Cyrus I beg to differ, that answer slightly tweaked yields (in javascript) `"aabbacdefgghijklmnoopq".match(/(.)\1{1,}/g) == ["aa", "bb", "gg", "oo"]` — Isaac, Jan 10 '17 at 21:58
@Isaac: But it returns `null` for "abcab", even though `a` and `b` are duplicate. — choroba, Jan 10 '17 at 22:02
if you are interested in only the duplicate letters and not the order: `echo "remove duplicate letters from string" | grep -oi '[a-z]' | awk 'seen[$0]++ == 1'` — Sundeep, Jan 11 '17 at 02:43

Casimir et Hippolyte · Accepted Answer · 2017-01-11T12:19:26.557

4

You can try to do that with GNU sed:

sed -E ':a;s/(.)\1*(.+)\1+/\1\1\2/;ta;s/(((.)\3)*)./\1/g;s/.(.)/\1/g;'

details: for the string "remove duplicate letters from string"

:a;s/(.)\1*(.+)\1+/\1\1\2/;ta; : this part replaces each duplicated letters separated by at least one character with two consecutive letters. Result:

rreemmoov  duplliicattssfng

s/(((.)\3)*)./\1/g; this one removes letters that stay alone. Result:

rreemmoo  lliittss

s/.(.)/\1/g this one removes consecutive letters. Result:

remo lits

With perl:

In a more or less similar way you can write something like this:

perl -pe's/(.)(?!.*\1)//g;while(s/(.)(.*)\1+/\1\2/g){}'

It's shorter but it's probably more efficient to use this second version with the autosplit switch and a hash to count the number of occurrences for each characters:

perl -F -ane'$h{$_}++ for(@F);for(@F){if($h{$_}>1){$h{$_}=1;print}}'

edited Jan 11 '17 at 12:19

answered Jan 10 '17 at 22:08

Casimir et Hippolyte

88,009
5
94
125

Impressive. It only works with GNU Sed, unfortunately; BSD Sed, with _extended_ regexes (`-E`), doesn't support back-references such as `\1` (in the regex itself, as opposed to in the replacement string). – mklement0 Jan 10 '17 at 22:36
1

@mklement0: The OP seems to use GNU sed, but you can do the same with perl: `perl -pe's/(.)(?!.*\1)//g;while(s/(.)(.*)\1+/\1\2/g){}'` – Casimir et Hippolyte Jan 10 '17 at 22:54
1

@mklement0: I will wait a little before adding the perl one liner, because I don't have the confirmation that this is what the OP wants, and I think there's probably smarter or more efficient ways to do it with perl. – Casimir et Hippolyte Jan 10 '17 at 23:07
Wow, I am impressed by the above original solution - although it looks over-engineered ! The result is correct ... I will edit the question to clarify further the fact that only letters appearing more than once are needed ! Perl solution is welcomed too ! – Adrian S. Jan 10 '17 at 23:43
1

The last perl snippet using the hash can be rewritten as 'perl -F -e '$h{$_}++ for (@F); for(keys %h){print if $h{$_} > 1 }' to avoid that '$h{$_}=1' – Adrian S. Jan 12 '17 at 21:07
@AdrianS.: if order doesn't matter, you can indeed write that, you can even shorten it to `perl -F -ane'$h{$_}++for@F;--$h{$_}&&print for keys%h'` – Casimir et Hippolyte Jan 12 '17 at 22:09
@AdrianS.: or `perl -F -anE'$h{$_}++for@F;say grep{$h{$_}>1}keys %h'` or `perl -nE'/(.)(?=.*\1)(??{$h{$1}=1})/;say keys%h'` – Casimir et Hippolyte Jan 13 '17 at 03:08

Ed Morton · Answer 2 · 2017-01-11T04:38:53.290

1

This will work with any awk on any system:

$ echo "remove duplicate letters from string" |
awk '{ for (i=1;i<=length($0);) { chr=substr($0,i,1); if (gsub(chr,"") > 1) printf "%c", chr } print "" }'
remo lits

edited Jan 11 '17 at 04:38

answered Jan 11 '17 at 04:25

Ed Morton

188,023
17
78
185

NeronLeVelu · Answer 3 · 2017-01-11T09:56:51.393

with posix sed (and gnu)

echo "remove duplicate letters from string" | sed -e ':a' -e 's/\(\(.\).*\2.*\)\2/\1/;ta' -e "G;:b" -e '/^\(.\)\(.*\)\1\(.*\n.*\)/s//\1\2\3\1/;tb' -e 's/.//;/^\n/b e' -e 'b b' -e ':e' -e 's/.//'

concept

limit occurence of letter to maximum twice ':a' -e 's/$\(.$.*\2.*\)\2/\1/;ta'
add a newline (at the end) using holder buffer G
test if first char is there twice (before a second line), if yes put it in a second line and remove the second occurence of the letter :b" -e '/^$.$$.*$\1$.*\n.*$/s//\1\2\3\1/;tb
remove first char s/.//
if first char is newline, go to end of script, remove the newline (and print) /^\n/b e' ... -e ':e'
if not loop -e 'b b'

score 0 · Answer 4 · answered Jan 11 '17 at 12:32

0

This might work for you (GNU sed):

sed -r ':a;s/\n*(([^\n]).*)\2/\n\1/;ta;s/\n(.)[^\n]*/\1/g' file

While removing duplicate characters prefix those concerned, with a unique marker i.e \n. Then remove all characters not associated with a marker ( and markers too) to leave only those characters which had duplicates.

answered Jan 11 '17 at 12:32

potong

55,640
6
51
83

I really like this one, it seems very logical ! – Adrian S. Jan 11 '17 at 17:38

Regular expression to ONLY print duplicate letters in string

4 Answers4