Match strings with certain number of unique characters in bash

Question

I need to delete all the strings in a file that have less than 4 unique characters in them

Input:

hello
cabby
pabba
lokka
lappa
coool
apple

Expected Output:

hello
cabby
lokka
apple

I tried to think up a regular expression to do this but I can't think how it would even be possible. I did find a sed command that seems promising, it deletes all duplicate characters. However, I am not sure how to program sed to test if the program returns 4 characters, and then if it does, match the original string. sed ':1;s/$\(.$.*\)\2/\1/g;t'

I doubt you can do that with `sed`. Even with a PCRE regex in `grep`, the pure regex solution looks unwieldly, see `grep -vP '^(?:(.)\1*(?:(?!\1)(.)(?:\1|\2)*(?:(?!\1|\2|3)(.)(?:\1|\2|\3)*)?)?)?$' file` ([demo](https://ideone.com/FJc9MM)). Use `awk`. — Wiktor Stribiżew, Mar 05 '20 at 19:31
See [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers). — pjh, Mar 05 '20 at 21:26

anubhava · Accepted Answer · 2020-03-07T03:52:58.757

4

Using gnu awk:

awk 'BEGIN{FS=""} {
unq=0; delete seen; for (i=1; i<=NF; i++) if (!seen[$i]++) unq++} unq > 3' file

hello
cabby
lokka
apple

FS="" breaks each character into a separate field in awk.

edited Mar 07 '20 at 03:52

answered Mar 05 '20 at 19:15

anubhava

761,203
64
569
643

Sorry, I wasn't online for a bit. I like your solution using awk! It is more human-readable than `sed` – Zyansheep Mar 06 '20 at 12:10
Is it sed slower? I was looking for a solution with sed because I though it was faster, but if awk is faster, I will pick that instead. In the end, I used this: `sed 'h;:1;s/$\(.$.*\)\2/\1/g;t1;/^.\{1,7\}$/d;x'` It filters out words with 7 or less unique characters – Zyansheep Mar 06 '20 at 15:42
1

Wait no, you are right, in this situation awk is about 4.2 seconds faster on my file. – Zyansheep Mar 06 '20 at 15:49
1

yes, I tested both `sed` and `awk` solution on a `36988000` line file. `awk` took about 1 min to finish but `sed` is stil running for last 8 min and consuming lot of CPU :( – anubhava Mar 06 '20 at 15:57
1

Maybe mention https://stackoverflow.com/a/31135987/3220113. I do agree that your answer is better readable (and faster). – Walter A Mar 06 '20 at 22:47
2

Thanks Wakter. I made it clear that it requires gnu awk – anubhava Mar 07 '20 at 03:53
If there are other columns besides this data set, what should we do when we select it as the first column? – ersan Mar 17 '21 at 11:35
1

@ersan: You may use: `awk 'BEGIN{FS=""} {s=""; unq=0; delete seen; for (i=1; i<=NF && $i !~ /^[[:blank:]]$/; i++) {s = s $i; if (!seen[$i]++) unq++}} unq > 3 {print s}' file` to work on first filed only. – anubhava Mar 17 '21 at 14:24

score 1 · Answer 2 · answered Mar 05 '20 at 22:21

You tried sed ':1;s/$\(.$.*\)\2/\1/g;t', please replace t by t1.
Before your command, copy the current line in the Hold space.
After your command, replace lines with at least 4 characters left with the original line.
Now make sure you only print lines with at least four characters.

echo 'hello
cabby
pabba
lokka
lappa
coool
apple' | sed -nE 'h;:1;s/(.)(.*)\1/\1\2/g;t1;/.{4}/x;/.{4}/p'

Match strings with certain number of unique characters in bash

2 Answers2