59

I am trying to validate some inputs to remove a set of characters. Only alphanumeric characters plus, period, underscore, hyphen are allowed. I've tested the regex expression [^\w.-] here http://gskinner.com/RegExr/ and it matches what I want removed so I not sure why sed is returning the opposite. What am I missing?

My end goal is to input "Â10.41.89.50 " and get "10.41.89.50".

I've tried:

echo "Â10.41.89.50 " | sed s/[^\w.-]//g returns Â...

echo "Â10.41.89.50 " | sed s/[\w.-]//g and echo "Â10.41.89.50 " | sed s/[\w^.-]//g returns Â10418950

I attempted the answer found here Skip/remove non-ascii character with sed but nothing was removed.

anubhava
  • 761,203
  • 64
  • 569
  • 643
wanderingandy
  • 823
  • 1
  • 8
  • 16
  • Try adding the `-r` option to `sed` so it will recognize extended regular expressions. – Barmar Nov 15 '13 at 17:44
  • `sed` doesn't understand the special character classes like `\w`. Just use `[a-zA-Z0-9_-]`. – Mark Reed Nov 15 '13 at 17:50
  • neither `-r` nor using `[a-zA-Z0-9_-]` works. Well `echo "Â10.41.89.50 " | sed s/[a-zA-Z0-9.-]//g` returned `Â` but `echo "Â10.41.89.50 " | sed s/[^a-zA-Z0-9.-]//g` still returned `Â10.41.89.50`. – wanderingandy Nov 15 '13 at 18:06

6 Answers6

83

's -c (complement) flag may be an option

echo "Â10.41.89.50-._ " | tr -cd '[:alnum:]._-'
iruvar
  • 22,736
  • 7
  • 53
  • 82
  • @AlexanderMills, @Herlon, While the `tr` incantation above is POSIX-compliant, I do not have MacOS handy to test – iruvar May 21 '18 at 17:14
  • 1
    This answer works fine on macOS, it's just that the locale includes non-English letters in the `:alnum:` character class (as it should). If you want to remove non-English characters, try this: `echo "Â10.41.89.50-._ /" | tr -cd '[a-zA-Z0-9]._-'` – tjmcewan Oct 29 '18 at 22:14
  • @iruvar, I think you need to drop those extra brackets (the class is '[:alnum:]' not '[[:alnum:]]' and tr isn't sed/perl etc.) as otherwise your expression will allow those non-alphanumerics ('[', ']') through. I've just hit this – elbeardmorez Mar 04 '19 at 07:48
  • 1
    using `LANG=C tr -cd '...'` might be a good idea – Fravadona Sep 29 '22 at 18:26
25

You might want to use the [:alpha:] class instead:

echo "Â10.41.89.50 " | sed "s/[[:alpha:].-]//g"

should work. If not, you might need to change your local settings.

On the other hand, if you only want to keep the digits, the hyphens and the period::

echo "Â10.41.89.50 " | sed "s/[^[:digit:].-]//g"

If your string is in a variable, you can use pure bash and parameter expansions for that:

$ dirty="Â10.41.89.50 "
$ clean=${dirty//[^[:digit:].-]/}
$ echo "$clean"
10.41.89.50

or

$ dirty="Â10.41.89.50 "
$ clean=${dirty//[[:alpha:]]/}
$ echo "$clean"
10.41.89.50

You can also have a look at 1_CR's answer.

gniourf_gniourf
  • 44,650
  • 9
  • 93
  • 104
  • 1
    @dw1: No, I don't think so. In the first example, we want to *remove all letters, periods and hyphens,* and that's what the command does (`sed` replaces these symbols by nothing). Last example is the same logic, but with Bash's parameter expansion. – gniourf_gniourf Jan 05 '21 at 12:23
13

To remove all characters except of alphanumeric and "-" use this code:

echo "a b-1_2" | sed "s/[^[:alnum:]-]//g"
panticz
  • 2,135
  • 25
  • 16
7

Well sed won't support unicode characters. Use perl instead:

> s="Â10.41.89.50 "
> perl -pe 's/[^\w.-]+//g' <<< "$s"
10.41.89.50
anubhava
  • 761,203
  • 64
  • 569
  • 643
2
s/[^[:alnum:]+._-]//g

removes anything other than alphanumeric and ".+_-" characters.

echo "Â10.41.89.50 +-_" | sed s/[^[:alnum:]+._-]//g
Â10.41.89.50+-_
Iwan Plays
  • 29
  • 3
0
<`[[:alnum:]_.@]`

This worked just fine for me. It preserved all of the characters I specified for my purposes.

spenibus
  • 4,339
  • 11
  • 26
  • 35
technerdius
  • 253
  • 3
  • 6
  • 16