Regex, every non-alphanumeric character except white space or colon

Question

How can I do this one anywhere?

Basically, I am trying to match all kinds of miscellaneous characters such as ampersands, semicolons, dollar signs, etc.

/[^a-zA-Z0-9\s\:]+/ - your star matches on 0 – Jason Jun 05 '23 at 19:21 — Jason, Jun 05 '23 at 19:21

Tudor Constantin · Accepted Answer · 2011-05-19T04:44:02.147

379

[^a-zA-Z\d\s:]

\d - numeric class
\s - whitespace
a-zA-Z - matches all the letters
^ - negates them all - so you get - non numeric chars, non spaces and non colons

edited May 19 '11 at 04:44

answered May 19 '11 at 04:00

Tudor Constantin

26,330
7
49
72

That's what I was looking at also :)) - I have to promote your perfect answer – Tudor Constantin May 19 '11 at 04:55
24

The only thing that I found is that this removes special characters like é or ã. I would prefer [^\w\d\s:]. – Eric Belair Oct 30 '15 at 15:45
12

Downvoted because this will not catch non-Latin characters, nor "special" Latin characters. – damian Jan 18 '16 at 08:18
1

`\d` and `\s` are Perl extensions which are typically not supported by older tools like `grep`, `sed`, `tr`, `lex`, etc. – tripleee Dec 06 '19 at 08:26
1

Another answer that's only useful for English or other Latin-based languages without accents. I think the world is a wee bit larger than that :). Downvoted. – MS Berends Sep 26 '22 at 11:47

score 47 · Answer 2 · edited May 08 '20 at 12:30

47

This should do it:

[^a-zA-Z\d\s:]

edited May 08 '20 at 12:30

Peter Mortensen

30,738
21
105
131

answered May 19 '11 at 03:53

Luke Sneeringer

9,270
2
35
32

1

The rest either check for space but not whitespace or have the negation in the wrong spot to actually negate. – Zachary Scott May 19 '11 at 04:29
\w catches underscores also - which is a non-alphanumeric character – Tudor Constantin May 19 '11 at 04:50
Aha! I shall modify -- I didn't know that. I expect it works differently for different engines, but might as well give the OP the safe answer. – Luke Sneeringer May 19 '11 at 04:51
5

Downvoted because this will not catch non-Latin characters, nor "special" Latin characters. – damian Jan 18 '16 at 08:19
@damian, see https://stackoverflow.com/a/73853673/4575331 – MS Berends Sep 26 '22 at 11:48

Nick F · Answer 3 · 2016-06-02T11:29:52.700

28

If you want to treat accented latin characters (eg. à Ñ) as normal letters (ie. avoid matching them too), you'll also need to include the appropriate Unicode range (\u00C0-\u00FF) in your regex, so it would look like this:

/[^a-zA-Z\d\s:\u00C0-\u00FF]/g

^ negates what follows
a-zA-Z matches upper and lower case letters
\d matches digits
\s matches white space (if you only want to match spaces, replace this with a space)
: matches a colon
\u00C0-\u00FF matches the Unicode range for accented latin characters.

nb. Unicode range matching might not work for all regex engines, but the above certainly works in Javascript (as seen in this pen on Codepen).

nb2. If you're not bothered about matching underscores, you could replace a-zA-Z\d with \w, which matches letters, digits, and underscores.

edited Jun 02 '16 at 11:29

answered Jun 02 '16 at 11:23

Nick F

9,781
7
75
90

This range contains some characters which are not alphanumeric (U+00D7 and U+00F7), and excludes a lot of valid accented characters from non-Western languages like Polish, Czech, Vietnamese etc. – tripleee Dec 06 '19 at 08:15
1

Upvoted for the description of each part of the RegEx. – morajabi Dec 09 '19 at 12:49

score 16 · Answer 4 · edited May 08 '20 at 12:30

16

Try this:

[^a-zA-Z0-9 :]

JavaScript example:

"!@#$%* ABC def:123".replace(/[^a-zA-Z0-9 :]/g, ".")

See a online example:

http://jsfiddle.net/vhMy8/

edited May 08 '20 at 12:30

Peter Mortensen

30,738
21
105
131

answered May 19 '11 at 03:56

Topera

12,223
15
67
104

5

Downvoted because this will not catch non-Latin characters, nor "special" Latin characters. – damian Jan 18 '16 at 08:19
22

It is easy to down vote an answer, and yet more difficult to provide constructive information to the board, e.g. how does one then catch non-Latin characters, nor "special" Latin characters? As of my count to here you have down voted 3 answers for the same reason, and in my opinion for a rather minor tweak. For example, I am here to find a regex for exactly what is discussed in these answers. I don't care about character sets that will not be used in my application. Law of diminishing returns. – Jun 15 '16 at 13:16
2

Aaron might be a "minor tweak" to a US citizen, but highly relevant for... the rest of this planet. – Michael K. Borregaard Mar 17 '20 at 10:07
2

`[^a-zA-Z0-9 :]` can be replaced with `[^\w:]` – Moses Schwartz Aug 18 '20 at 17:40
`\w` includes underscores also, so keep an eye on that – Alter Lagos Jul 30 '21 at 00:34
@ user3842449 `I don't care about character sets that will not be used in my application.` Well, that's quite self-centric. Moreover, the question of the OP was to remove whitespaces, so this should answer this question. For pretty much all non-English languages, it doesn't. – MS Berends Sep 26 '22 at 11:34

score 7 · Answer 5 · answered Jun 09 '20 at 02:54

7

In JavaScript:

/[^\w_]/g

^ negation, i.e. select anything not in the following set

\w any word character (i.e. any alphanumeric character, plus underscore)

_ negate the underscore, as it's considered a 'word' character

Usage example - const nonAlphaNumericChars = /[^\w_]/g;

answered Jun 09 '20 at 02:54

Chris Halcrow

28,994
18
176
206

5

`[^\w_]` is the same as `[^\w]` (as `_` is a word char), and it is equal to `\W`. – Wiktor Stribiżew Aug 06 '21 at 14:28

score 5 · Answer 6 · answered Jul 16 '15 at 11:32

5

No alphanumeric, white space or '_'.

var reg = /[^\w\s)]|[_]/g;

answered Jul 16 '15 at 11:32

Vasyl Gutnyk

4,813
2
34
37

score 5 · Answer 7 · edited May 08 '20 at 12:31

5

If you mean "non-alphanumeric characters", try to use this:

var reg =/[^a-zA-Z0-9]/g      //[^abc]

edited May 08 '20 at 12:31

Peter Mortensen

30,738
21
105
131

answered Apr 26 '17 at 02:33

Kim-Trinh

81
2
3

score 1 · Answer 8 · edited May 08 '20 at 12:35

1

This regex works for C#, PCRE and Go to name a few.

It doesn't work for JavaScript on Chrome from what RegexBuddy says. But there's already an example for that here.

This main part of this is:

\p{L}

which represents \p{L} or \p{Letter} any kind of letter from any language.`

The full regex itself: [^\w\d\s:\p{L}]

Example: https://regex101.com/r/K59PrA/2

edited May 08 '20 at 12:35

Peter Mortensen

30,738
21
105
131

answered Nov 20 '19 at 16:41

Ste

1,729
1
17
27

This is the only answer here which deals correctly with Unicode accented alphabetics in a proper way. Sadly, not all regex engines support this facility (even Python lacks it, as of 3.8, even though its regex engine is ostensibly PCRE-based). – tripleee Dec 06 '19 at 08:30
1

I'll remove Python from the answer, I thought I tested that but apparently not. Thanks for pointing that out. – Ste Dec 06 '19 at 12:19

MS Berends · Answer 9 · 2022-09-26T11:52:48.970

Previous solutions only seem reasonable for English or other Latin-based languages without accents, etc. Those answers are for that reason not generalised to answer the question.

According to the Whitespace character article on Wikipedia, these are all the whitespace characters in Unicode:

U+0009, U+000A, U+000B, U+000C, U+000D, U+0020, U+0085, U+00A0, U+1680, U+180E, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+200B, U+200C, U+200D, U+2028, U+2029, U+202F, U+205F, U+2060, U+3000, U+FEFF

So in my opinion, the most inclusive solution would be (might be slow, but this is about accuracy):

\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF

Thus, to answer OP's question to include "every non-alphanumeric character except white space or colon", prepend a hat ^ to not include above characters and add the colon to that, and surround the regex in [ and ] to instruct it to 'any of these characters':

"[^:\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]"

Debuggex Demo

Bonus: solution for R

trimws2 <- function(..., whitespace = "[\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]") {
  trimws(..., whitespace = whitespace)
}

This is even faster than trimws() itself which sets " \t\n\r".

microbenchmark::microbenchmark(trimws2(" \t\r\n"), trimws(" \t\r\n"))
#> Unit: microseconds
#>                   expr    min     lq     mean  median      uq     max neval cld
#>  trimws2(" \\t\\r\\n") 29.177 29.875 31.94345 30.4990 31.3895 105.642   100  a 
#>   trimws(" \\t\\r\\n") 45.811 46.630 48.25076 47.2545 48.2765 116.571   100   b

score -3 · Answer 10 · edited May 08 '20 at 12:31

-3

Try to add this:

^[^a-zA-Z\d\s:]*$

This has worked for me... :)

edited May 08 '20 at 12:31

Peter Mortensen

30,738
21
105
131

answered Jun 18 '14 at 06:51

Er Parthu

20
3

2

This seems to repeat the accepted answer from 2011. The `^` and `$` anchors confines it to match entire lines and the `*` quantifier means it also matches empty lines. – tripleee Dec 06 '19 at 08:23

score -3 · Answer 11 · answered May 15 '22 at 13:48

-3

[^\w\s-]

Character set of characters which not:

Alphanumeric
Whitespace
Colon

answered May 15 '22 at 13:48

its_ me

11
2

Regex, every non-alphanumeric character except white space or colon

11 Answers11

Linked

Related