sed: remove whole words containg a character class

Question

I'd like to remove any word which contains a non alpha char from a text file. e.g

"ok 0bad ba1d bad3 4bad4 5bad5bad5"

should become

"ok"

I've tried using

echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/\b[a-zA-Z]*[^a-zA-Z]\+[a-zA-Z]*\b/ /g'

Is it non-alpha you want removing, or is it numeric? What was wrong with your attempt? — Tom Fenech, Aug 06 '14 at 11:19
All non-alpha, not just numeric. It produced a wrong answer. — dimid, Aug 06 '14 at 11:23

jaybee · Answer 1 · 2014-08-07T06:44:16.027

The following sed command does the job:

sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'

It removes all words containing at least one non-alphabetic character. It is better to use POSIX character classes like [:alpha:], because for instance they won't consider the French name "François" as being faulty (i.e. containing a non-alphabetic character).

Explanation

We remove all patterns starting with an arbitrary number of spaces followed by an arbitrary (possibly nil) number of alphabetic characters, followed by at least one non-space and non-alphabetic character, and then glob to the end of the word (i.e. until the next space). Please note that you may want to swap [:space:] for [:blank:], see this page for a detailed explanation of the difference between these two POSIX classes.

Test

$ echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
ok

+1, I was trying to work out something along these lines myself. Your command in perl: `perl -pe 's/\s*[[:alpha:]]*[^\s[:alpha:]]\S*//g'` — Tom Fenech, Aug 06 '14 at 12:39

anubhava · Accepted Answer · 2014-08-06T14:45:57.097

3

Using awk:

s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
awk '{ofs=""; for (i=1; i<=NF; i++) if ($i ~ /^[[:alpha:]]+$/)
         {printf "%s%s", ofs, $i; ofs=OFS} print ""}' <<< "$s"
ok

This awk command loops through all words and if word matches the regex /^[[:alpha:]]+$/ then it writes to standard out. (i<NF)?OFS:RS is a short cut to add OFS if current field no is less than NF otherwise it writes RS.

Using grep + tr together:

s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
r=$(grep -o '[^ ]\+' <<< "$s"|grep '^[[:alpha:]]\+$'|tr '\n' ' ')
echo "$r"
ok

First grep -o breaks the string into individual words. 2nd grep only searches for words with alphabets only. ANd finally tr translates \n to space.

edited Aug 06 '14 at 14:45

answered Aug 06 '14 at 11:06

anubhava

761,203
64
569
643

Thanks. Could you add an explanation? – dimid Aug 06 '14 at 11:21
I was told by someone that I always needed parentheses to be portable when testing in `printf`. I do see you do `(i – Jotne Aug 06 '14 at 11:23
1

@Dimid: I added explanation to my answer. – anubhava Aug 06 '14 at 11:33
2

I repeat my comment written below: please use `[[:alpha:]]` instead of the buggy `[a-zA-Z]`, because if you replace in `s` your "ok" by "Öcalan", for instance, your script outputs an empty string. :( – jaybee Aug 06 '14 at 12:13
2

@Jotne - in some awks (e.g. OSX awk) `print ` will cause a syntax error depending on the condition in the ternary expression while `print ()` will always succeed. I don't know for sure if there are other situations in which an unparenthesized ternary expression will fail, nor do I know if just parenthesizing the condition part of the expression is enough to get around that. I personally find parenthesizing the whole expressions easier to read anyway and I know it works in all situations so I just do that. – Ed Morton Aug 06 '14 at 12:56
1

I actually work on OSX and tested above awk using default OSX version of awk. But yes for better compatibility it seems it will be better to use `(i – anubhava Aug 06 '14 at 12:58
@anubhava +1 for the solution but you need to change your `printf` because if the last field on the line contains a non-alpha character then you won't get a newline at the end of the line and depending on the preceding fields contents you may have a hanging blank char too. – Ed Morton Aug 06 '14 at 12:58
Thanks Ed. But if you notice in the input string last field already has a non-alpha character above. – anubhava Aug 06 '14 at 13:00
2

Yes it does and as I mentioned your solution is adding a trailing blank char and not providing a newline. You need `awk '{ofs=""; for (i=1; i<=NF; i++) if ($i ~ /^[[:alpha:]]+$/) {printf "%s%s", ofs, $i; ofs=OFS} print "" }'` or similar. – Ed Morton Aug 06 '14 at 13:02
1

Yes that's right, I also notice trailing space issue. Let me put up an edited version. – anubhava Aug 06 '14 at 13:04
IMHO adding it and then removing it as a much worse approach than just not adding it in the first place. – Ed Morton Aug 06 '14 at 14:31
Yes I could have stored those fields in an array and print in later but that is definitely more code than this code. And come to think of it, it is just removing a single space from end of string. – anubhava Aug 06 '14 at 14:36
1

Did you see the suggestion I put i my comment? It just prints the space before each field except the first one rather than after each field and it's briefer and IMHO clearer than what you currently have. – Ed Morton Aug 06 '14 at 14:42
1

Ah I totally missed it. Sorry let me just edit the answer using your nice suggestion. – anubhava Aug 06 '14 at 14:44

score 0 · Answer 3 · answered Aug 06 '14 at 11:48

If you're not concerned about losing different numbers of spaces between each word, you could use something like this in Perl:

perl -ane 'print join(" ", grep { !/[^[:alpha:]]/ } @F), "\n"

the -a switch enables auto-split mode, which splits the text on any number of spaces and stores the fields in the array @F. grep filters out the elements of that array that contain any non-alphabetical characters. The resulting array is joined on a single space.

score 0 · Answer 4 · answered Aug 06 '14 at 16:38

0

This might work for you (GNU sed):

sed -r 's/\b([[:alpha:]]+\b ?)|\S+\b ?/\1/g;s/ $//' file

This uses a back reference within alternation to save the required string.

answered Aug 06 '14 at 16:38

potong

55,640
6
51
83

MONTYHS · Answer 5 · 2014-08-06T14:35:26.217

-1

 st="ok 0bad ba1d bad3 4bad4 5bad5bad5"
 for word in $st; 
     do 
     if [[ $word =~  ^[a-zA-Z]+$ ]]; 
         then 
             echo $word; 
      fi; 
 done

edited Aug 06 '14 at 14:35

answered Aug 06 '14 at 11:19

MONTYHS

926
1
7
30

That's not how you assign variables in bash. Check [this answer](http://stackoverflow.com/a/8737671/1072112). – ghoti Aug 06 '14 at 13:42
by mistake i added $ and space – MONTYHS Aug 06 '14 at 14:36

sed: remove whole words containg a character class

5 Answers5

Explanation

Test

Linked