4

I'm looking for a regexp to remove one character words. I don't mind whether using perl, awk, sed or bash built-ins.

Test case:

$ echo "a b c d e f g h ijkl m n opqrst u v" | $COMMAND

Desired output:

ijkl opqrst

What I've tried so far:

$ echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/ . //g'
acegijkln opqrstv

I'm guessing that:

  • the a isn't removed because there is no white space before it
  • the c remains because once the b has been removed, there is no more whitespace before it
  • and so on...

Attempt #2:

$ echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/\w.\w//g'
     s v

Here I don't get at all what's happening.

Any help + explanations are welcome, I want to learn.

nicoco
  • 1,421
  • 9
  • 30
  • 2
    Possible duplicate of [Learning Regular Expressions](http://stackoverflow.com/questions/4736/learning-regular-expressions) – Biffen Jan 17 '17 at 09:41
  • 1
    Hum I disagree, there is a specific question in my post. – nicoco Jan 17 '17 at 09:42
  • `.` matches *any* character, including space. `\w` matches word characters, so I don't see what you're attempting with `\w.\w`. – Biffen Jan 17 '17 at 09:42
  • ‘*there is a specific question in my post*’ Would you mind pointing it out? I can't find it. – Biffen Jan 17 '17 at 09:43
  • As the post title states, I'm trying to remove one-character words. I thought that `.` didn't match whitespace, that's a good start, thanks. – nicoco Jan 17 '17 at 09:43
  • 1
    @nicoco, You can try with word boundary (`\b`). – sat Jan 17 '17 at 09:44
  • 1
    @nicoco That's not a *question*, though. IMHO, this looks like a give-me-the-code post. – Biffen Jan 17 '17 at 09:44
  • @sat Thanks, I didn't know this one. – nicoco Jan 17 '17 at 09:49
  • 3
    @Biffen: I disagree. The OP has written a solution to their problem and is asking for help to get it working. – Borodin Jan 17 '17 at 10:53

8 Answers8

7

You have to use word boundary \b (or) \< and \> respectively match the empty string at the beginning and end of a word.

echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/\b\w\b \?//g'

(OR)

echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/\<.\> \?//g'
sat
  • 14,589
  • 7
  • 46
  • 65
4

You could simply use grep:

echo "a b c d e f g h ijkl m n opqrst u v"  | grep -o '[a-z]\{2,\}'

where the regex is matching any word composed with at least 2 characters.

The -o option in grep prints the matching pattern (and not the entire line).

oliv
  • 12,690
  • 25
  • 45
2

Albeit, Awk is not the most efficient of ways to do this, answering only because it is tagged , using its length() string function. It is POSIX compliant, so no issues on portability.

echo "a b c d e f g h ijkl m n opqrst u v" | \
  awk '{for(i=1;i<=NF;i++) {if (length($i)>1) { printf "%s ", $i }} }'
ijkl opqrst
Inian
  • 80,270
  • 14
  • 142
  • 161
  • You shouldn't say `awk is not the most efficient way....`, just that the specific awk code you posted is not the most efficient way. – Ed Morton Jan 17 '17 at 16:43
  • 1
    @EdMorton: As you say Ed! May be you can correct my logic or provide a more efficient way for this. – Inian Jan 17 '17 at 16:44
  • I added the awk equivalent of the accepted answer under that answer, see http://stackoverflow.com/a/41693834/1745001 – Ed Morton Jan 17 '17 at 16:45
  • @EdMorton: Well! everybody can't answer in the same _class_ as of Ed Morton in `awk` – Inian Jan 17 '17 at 16:46
1

Perl solution: just filter elements on length

echo "a b c d e f g h ijkl m n opqrst u v" | perl -lanE \
  'say join " ", grep {length($_) > 1} @F'
Arunesh Singh
  • 3,489
  • 18
  • 26
1

Just for fun, another option: translate spaces to newlines and look for lines with at least 2 characters

$ echo "a b c d e f g h ijkl m n opqrst u v" | tr ' ' '\n' | grep .. | paste -sd " "
ijkl opqrst
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
0

Not being familiar with any linux sprung tools, this is somewhat of a guess, but I think the (a) regex you want is

(?:\s\w\b|\b\w\s)

like

$ echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/(?:\s\w\b|\b\w\s)//g'

This would replace any single character either preceded by, or foolowed by, a space with nothing.

Check the regex out here at regex101.

SamWhan
  • 8,296
  • 1
  • 18
  • 45
0

Another in awk. A non-space ([^ ]) is considered a word. Feel free to replace it with your definition of a word.

$ awk '{while(sub(/^[^ ] | [^ ]$/,"")||sub(/ [^ ] /," "));}1'

Using sub it replaces [a space][non-space][a space] tuples with a space and removes from the beginning and end of record the single characters and leading / trailing space. It's in a while so it keeps doing it until there are no hits left. To test it:

$ echo "a b c d e f g h ijkl m n opqrst u v"|awk '{while(sub(/^[^ ] | [^ ]$/,"")||sub(/ [^ ] /," "));}1'
ijkl opqrst
James Brown
  • 36,089
  • 7
  • 43
  • 59
-1
echo "a b c d e f g h ijkl m n opqrst u v"  | grep -wo "\b[a-z][a-z]\+\b"
Vicky
  • 1,298
  • 1
  • 16
  • 33