3

I need to delete all the strings in a file that have less than 4 unique characters in them

Input:

hello
cabby
pabba
lokka
lappa
coool
apple

Expected Output:

hello
cabby
lokka
apple

I tried to think up a regular expression to do this but I can't think how it would even be possible. I did find a sed command that seems promising, it deletes all duplicate characters. However, I am not sure how to program sed to test if the program returns 4 characters, and then if it does, match the original string. sed ':1;s/\(\(.\).*\)\2/\1/g;t'

anubhava
  • 761,203
  • 64
  • 569
  • 643
Zyansheep
  • 168
  • 2
  • 11
  • I doubt you can do that with `sed`. Even with a PCRE regex in `grep`, the pure regex solution looks unwieldly, see `grep -vP '^(?:(.)\1*(?:(?!\1)(.)(?:\1|\2)*(?:(?!\1|\2|3)(.)(?:\1|\2|\3)*)?)?)?$' file` ([demo](https://ideone.com/FJc9MM)). Use `awk`. – Wiktor Stribiżew Mar 05 '20 at 19:31
  • See [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers). – pjh Mar 05 '20 at 21:26

2 Answers2

4

Using gnu awk:

awk 'BEGIN{FS=""} {
unq=0; delete seen; for (i=1; i<=NF; i++) if (!seen[$i]++) unq++} unq > 3' file

hello
cabby
lokka
apple

FS="" breaks each character into a separate field in awk.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Sorry, I wasn't online for a bit. I like your solution using awk! It is more human-readable than `sed` – Zyansheep Mar 06 '20 at 12:10
  • Is it sed slower? I was looking for a solution with sed because I though it was faster, but if awk is faster, I will pick that instead. In the end, I used this: `sed 'h;:1;s/\(\(.\).*\)\2/\1/g;t1;/^.\{1,7\}$/d;x'` It filters out words with 7 or less unique characters – Zyansheep Mar 06 '20 at 15:42
  • 1
    Wait no, you are right, in this situation awk is about 4.2 seconds faster on my file. – Zyansheep Mar 06 '20 at 15:49
  • 1
    yes, I tested both `sed` and `awk` solution on a `36988000` line file. `awk` took about 1 min to finish but `sed` is stil running for last 8 min and consuming lot of CPU :( – anubhava Mar 06 '20 at 15:57
  • 1
    Maybe mention https://stackoverflow.com/a/31135987/3220113. I do agree that your answer is better readable (and faster). – Walter A Mar 06 '20 at 22:47
  • 2
    Thanks Wakter. I made it clear that it requires gnu awk – anubhava Mar 07 '20 at 03:53
  • If there are other columns besides this data set, what should we do when we select it as the first column? – ersan Mar 17 '21 at 11:35
  • 1
    @ersan: You may use: `awk 'BEGIN{FS=""} {s=""; unq=0; delete seen; for (i=1; i<=NF && $i !~ /^[[:blank:]]$/; i++) {s = s $i; if (!seen[$i]++) unq++}} unq > 3 {print s}' file` to work on first filed only. – anubhava Mar 17 '21 at 14:24
1

You tried sed ':1;s/\(\(.\).*\)\2/\1/g;t', please replace t by t1.
Before your command, copy the current line in the Hold space.
After your command, replace lines with at least 4 characters left with the original line.
Now make sure you only print lines with at least four characters.

echo 'hello
cabby
pabba
lokka
lappa
coool
apple' | sed -nE 'h;:1;s/(.)(.*)\1/\1\2/g;t1;/.{4}/x;/.{4}/p'
Walter A
  • 19,067
  • 2
  • 23
  • 43