Remove one-character words

Question

I'm looking for a regexp to remove one character words. I don't mind whether using perl, awk, sed or bash built-ins.

Test case:

$ echo "a b c d e f g h ijkl m n opqrst u v" | $COMMAND

Desired output:

ijkl opqrst

What I've tried so far:

$ echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/ . //g'
acegijkln opqrstv

I'm guessing that:

the a isn't removed because there is no white space before it
the c remains because once the b has been removed, there is no more whitespace before it
and so on...

Attempt #2:

$ echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/\w.\w//g'
     s v

Here I don't get at all what's happening.

Any help + explanations are welcome, I want to learn.

Possible duplicate of [Learning Regular Expressions](http://stackoverflow.com/questions/4736/learning-regular-expressions) — Biffen, Jan 17 '17 at 09:41
`.` matches *any* character, including space. `\w` matches word characters, so I don't see what you're attempting with `\w.\w`. — Biffen, Jan 17 '17 at 09:42
‘*there is a specific question in my post*’ Would you mind pointing it out? I can't find it. — Biffen, Jan 17 '17 at 09:43
As the post title states, I'm trying to remove one-character words. I thought that `.` didn't match whitespace, that's a good start, thanks. — nicoco, Jan 17 '17 at 09:43
@nicoco That's not a *question*, though. IMHO, this looks like a give-me-the-code post. — Biffen, Jan 17 '17 at 09:44
@Biffen: I disagree. The OP has written a solution to their problem and is asking for help to get it working. — Borodin, Jan 17 '17 at 10:53

sat · Accepted Answer · 2017-01-17T09:55:55.380

7

You have to use word boundary \b (or) \< and \> respectively match the empty string at the beginning and end of a word.

echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/\b\w\b \?//g'

(OR)

echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/\<.\> \?//g'

edited Jan 17 '17 at 09:55

answered Jan 17 '17 at 09:49

sat

14,589
7
46
65

It leaves a lot of white spaces before the "long" words, but I can work with that. Thanks! – nicoco Jan 17 '17 at 09:51
1

@nicoco You can use `s/\b\w\b ?//g` to remove the whispaces aswell. – Dada Jan 17 '17 at 09:52
Be very careful with `\b`: what you have will clobber things like "will-o'-the-wisp" and "Build-A-Bear". – ThisSuitIsBlackNot Jan 17 '17 at 15:39
2

Or the same solution with GNU awk: `awk '{gsub(/\<.\> ?/,"")}1'`. – Ed Morton Jan 17 '17 at 16:44

oliv · Answer 2 · 2017-01-17T10:35:17.253

4

You could simply use grep:

echo "a b c d e f g h ijkl m n opqrst u v"  | grep -o '[a-z]\{2,\}'

where the regex is matching any word composed with at least 2 characters.

The -o option in grep prints the matching pattern (and not the entire line).

edited Jan 17 '17 at 10:35

answered Jan 17 '17 at 09:53

oliv

12,690
25
45

You could use `grep -E` so you wouldn't need those pesky backslashes. – tripleee Jan 17 '17 at 10:43
It should be noted that this separates all matches with a newline, which is not exactly the same as the desired output as written in the question. This may or may not be a problem, depending on the circumstances. – Toby Speight Jan 17 '17 at 10:46
In that case, pipe into `| paste -sd " "` – glenn jackman Jan 17 '17 at 14:06

Inian · Answer 3 · 2017-01-17T10:43:48.147

2

Albeit, Awk is not the most efficient of ways to do this, answering only because it is tagged awk, using its length() string function. It is POSIX compliant, so no issues on portability.

echo "a b c d e f g h ijkl m n opqrst u v" | \
  awk '{for(i=1;i<=NF;i++) {if (length($i)>1) { printf "%s ", $i }} }'
ijkl opqrst

edited Jan 17 '17 at 10:43

answered Jan 17 '17 at 10:37

Inian

80,270
14
142
161

You shouldn't say `awk is not the most efficient way....`, just that the specific awk code you posted is not the most efficient way. – Ed Morton Jan 17 '17 at 16:43
1

@EdMorton: As you say Ed! May be you can correct my logic or provide a more efficient way for this. – Inian Jan 17 '17 at 16:44
I added the awk equivalent of the accepted answer under that answer, see http://stackoverflow.com/a/41693834/1745001 – Ed Morton Jan 17 '17 at 16:45
@EdMorton: Well! everybody can't answer in the same _class_ as of Ed Morton in `awk` – Inian Jan 17 '17 at 16:46

score 1 · Answer 4 · answered Jan 17 '17 at 11:01

1

Perl solution: just filter elements on length

echo "a b c d e f g h ijkl m n opqrst u v" | perl -lanE \
  'say join " ", grep {length($_) > 1} @F'

answered Jan 17 '17 at 11:01

Arunesh Singh

3,489
18
26

If you want to be more terse, you can omit the default variable: `grep {length > 1} @F` – glenn jackman Jan 17 '17 at 14:07

score 1 · Answer 5 · answered Jan 17 '17 at 14:09

1

Just for fun, another option: translate spaces to newlines and look for lines with at least 2 characters

$ echo "a b c d e f g h ijkl m n opqrst u v" | tr ' ' '\n' | grep .. | paste -sd " "
ijkl opqrst

answered Jan 17 '17 at 14:09

glenn jackman

238,783
38
220
352

score 0 · Answer 6 · answered Jan 17 '17 at 11:15

0

Not being familiar with any linux sprung tools, this is somewhat of a guess, but I think the (a) regex you want is

(?:\s\w\b|\b\w\s)

like

$ echo "a b c d e f g h ijkl m n opqrst u v" | sed 's/(?:\s\w\b|\b\w\s)//g'

This would replace any single character either preceded by, or foolowed by, a space with nothing.

Check the regex out here at regex101.

answered Jan 17 '17 at 11:15

SamWhan

8,296
1
18
45

sed -r 's/(\s\w\b|\b\w\s)//g' – mug896 Jan 31 '17 at 08:12

James Brown · Answer 7 · 2017-01-17T12:44:33.187

Another in awk. A non-space ([^ ]) is considered a word. Feel free to replace it with your definition of a word.

$ awk '{while(sub(/^[^ ] | [^ ]$/,"")||sub(/ [^ ] /," "));}1'

Using sub it replaces [a space][non-space][a space] tuples with a space and removes from the beginning and end of record the single characters and leading / trailing space. It's in a while so it keeps doing it until there are no hits left. To test it:

$ echo "a b c d e f g h ijkl m n opqrst u v"|awk '{while(sub(/^[^ ] | [^ ]$/,"")||sub(/ [^ ] /," "));}1'
ijkl opqrst

Vicky · Answer 8 · 2017-01-17T10:28:15.653

-1

echo "a b c d e f g h ijkl m n opqrst u v"  | grep -wo "\b[a-z][a-z]\+\b"

edited Jan 17 '17 at 10:28

answered Jan 17 '17 at 10:22

Vicky

1,298
1
16
33

With `-w` you don't need the `\b` anchors. – tripleee Jan 17 '17 at 10:45

Remove one-character words

8 Answers8