Replace all non alphanumeric characters, new lines, and multiple white space with one space

Question

I'm looking for a neat regex solution to replace

All non alphanumeric characters
All newlines
All multiple instances of white space

With a single space

For those playing at home (the following does work)

text.replace(/[^a-z0-9]/gmi, " ").replace(/\s+/g, " ");

My thinking is regex is probably powerful enough to achieve this in one statement. The components I think I'd need are

[^a-z0-9] - to remove non alphanumeric characters
\s+ - match any collections of spaces
\r?\n|\r - match all new line
/gmi - global, multi-line, case insensitive

However, I can't seem to style the regex in the right way (the following doesn't work)

text.replace(/[^a-z0-9]|\s+|\r?\n|\r/gmi, " ");

Input

234&^%,Me,2 2013 1080p x264 5 1 BluRay
S01(*&asd 05
S1E5
1x05
1x5

Desired Output

234 Me 2 2013 1080p x264 5 1 BluRay S01 asd 05 S1E5 1x05 1x5

How exactly does your attempt not work? What goes wrong? – Pointy Jan 01 '14 at 01:52 — Pointy, Jan 01 '14 at 01:52

Jonny 5 · Accepted Answer · 2014-09-14T20:37:33.233

336

Be aware, that \W leaves the underscore. A short equivalent for [^a-zA-Z0-9] would be [\W_]

text.replace(/[\W_]+/g," ");

\W is the negation of shorthand \w for [A-Za-z0-9_] word characters (including the underscore)

Example at regex101.com

edited Sep 14 '14 at 20:37

answered Jan 01 '14 at 02:02

Jonny 5

12,171
2
25
42

Check it and test it, have not much experience in js-regex yet :p Happy you like it – Jonny 5 Jan 01 '14 at 02:09
9

Note that `\W` will also recognize non-Latin characters as non-word chars. – webketje May 23 '15 at 10:06
3

I marked this answer correct after all these years, because i looked back and the accepted didn't exclude underscores – TheGeneral Apr 12 '18 at 11:23

T-CatSan · Answer 2 · 2014-01-01T02:55:12.743

151

Jonny 5 beat me to it. I was going to suggest using the \W+ without the \s as in text.replace(/\W+/g, " "). This covers white space as well.

edited Jan 01 '14 at 02:55

answered Jan 01 '14 at 02:20

T-CatSan

1,543
1
9
3

Thanks @T-CatSan for pointing that out! Upped it, and Saruman, you're free to change best answer to whatever :-) But it should be `\W+`, not `[W+]` Well, happy new year all! – Jonny 5 Jan 01 '14 at 02:30
Thanks, @Jonny5! I've made the change you suggested. I had tested with the brackets before and now I see it works without them. Happy New Year to you, too. – T-CatSan Jan 01 '14 at 02:44
1

hey @T-CatSan is there a way to add exceptions? I want to keep characters `&` and `-`. Any tips? – Renato Gama Nov 30 '15 at 13:02
1

I made the following change /(\W+)|(_)/g to ignore _ also. But just wondering why it is not ignoring in the first model and is my regex is the efficient one. – Sridhar Gudimela Jan 25 '18 at 19:50

score 20 · Answer 3 · answered Jan 01 '14 at 02:05

20

Since [^a-z0-9] character class contains all that is not alnum, it contains white characters too!

 text.replace(/[^a-z0-9]+/gi, " ");

answered Jan 01 '14 at 02:05

Casimir et Hippolyte

88,009
5
94
125

nice and working – yigitt Jan 19 '22 at 19:53

score 9 · Answer 4 · answered Jan 01 '14 at 01:58

9

Well I think you just need to add a quantifier to each pattern. Also the carriage-return thing is a little funny:

text.replace(/[^a-z0-9]+|\s+/gmi, " ");

edit The \s thing matches \r and \n too.

answered Jan 01 '14 at 01:58

Pointy

405,095
59
585
614

Yeah there was some tom foolery in there gleaned from other answers on the topic, however that works great thanks! – TheGeneral Jan 01 '14 at 02:02

TheGeneral · Answer 5 · 2021-12-16T20:18:09.850

Update

Please be aware, the browser landscape changes rapidly, these benchmarks would be woefully out of date, and likely misleading at the time you reading this.

This is an old post of mine, the other answers are good for the most part. However I decided to benchmark each solution and another obvious one (just for fun). I wondered if there was a difference between the regex patterns on different browsers with different sized strings.

So basically I used jsPerf on

Testing in Chrome 65.0.3325 / Windows 10 0.0.0
Testing in Edge 16.16299.0 / Windows 10 0.0.0

The regex patterns I tested were

/[\W_]+/g
/[^a-z0-9]+/gi
/[^a-zA-Z0-9]+/g

I loaded them up with a string length of random characters

length 5000
length 1000
length 200

Example javascript I used var newstr = str.replace(/[\W_]+/g," ");

Each run consisted of 50 or more sample on each regex, and i run them 5 times on each browser.

Lets race our horses!

Results

                                Chrome                  Edge
Chars   Pattern                 Ops/Sec     Deviation   Op/Sec      Deviation
------------------------------------------------------------------------
5,000   /[\W_]+/g                19,977.80  1.09         10,820.40  1.32
5,000   /[^a-z0-9]+/gi           19,901.60  1.49         10,902.00  1.20
5,000   /[^a-zA-Z0-9]+/g         19,559.40  1.96         10,916.80  1.13
------------------------------------------------------------------------
1,000   /[\W_]+/g                96,239.00  1.65         52,358.80  1.41
1,000   /[^a-z0-9]+/gi           97,584.40  1.18         52,105.00  1.60
1,000   /[^a-zA-Z0-9]+/g         96,965.80  1.10         51,864.60  1.76
------------------------------------------------------------------------
  200   /[\W_]+/g               480,318.60  1.70        261,030.40  1.80
  200   /[^a-z0-9]+/gi          476,177.80  2.01        261,751.60  1.96
  200   /[^a-zA-Z0-9]+/g        486,423.00  0.80        258,774.20  2.15

Truth be known, Regex in both browsers (taking into consideration deviation) were nearly indistinguishable, however i think if it run this even more times the results would become a little more clearer (but not by much).

Theoretical scaling for 1 character

                            Chrome                        Edge
Chars   Pattern             Ops/Sec     Scaled            Op/Sec    Scaled
------------------------------------------------------------------------
5,000   /[\W_]+/g            19,977.80  99,889,000       10,820.40  54,102,000
5,000   /[^a-z0-9]+/gi       19,901.60  99,508,000       10,902.00  54,510,000
5,000   /[^a-zA-Z0-9]+/g     19,559.40  97,797,000       10,916.80  54,584,000
------------------------------------------------------------------------

1,000   /[\W_]+/g            96,239.00  96,239,000       52,358.80  52,358,800
1,000   /[^a-z0-9]+/gi       97,584.40  97,584,400       52,105.00  52,105,000
1,000   /[^a-zA-Z0-9]+/g     96,965.80  96,965,800       51,864.60  51,864,600
------------------------------------------------------------------------

  200   /[\W_]+/g           480,318.60  96,063,720      261,030.40  52,206,080
  200   /[^a-z0-9]+/gi      476,177.80  95,235,560      261,751.60  52,350,320
  200   /[^a-zA-Z0-9]+/g    486,423.00  97,284,600      258,774.20  51,754,840

I wouldn't take to much into these results as this is not really a significant differences, all we can really tell is edge is slower :o . Additionally that i was super bored.

Anyway you can run the benchmark for your self.

Jsperf Benchmark here

This seems completely unrelated to OP's query. – kevinc Apr 06 '23 at 20:16 — kevinc, Apr 06 '23 at 20:16

score 5 · Answer 6 · answered Jun 28 '22 at 21:42

When Unicode comes to play use

text.replace(/[^\p{L}\p{N}]+/gu," ");

EXPLANATION

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  [^\p{L}\p{N}]+           Any character except Unicode letters and digits
                           (1 or more times (matching the most amount possible))

JavaScript code snippet:

const text = `234&^%,Me,2 2013 1080p x264 5 1 BluRąy
S01(*&aśd 05
S1E5
1x05
1x5`
console.log(text.replace(/[^\p{L}\p{N}]+/gu, ` `))

score 4 · Answer 7 · answered Feb 21 '18 at 11:43

4

A saw a different post that also had diacritical marks, which is great

s.replace(/[^a-zA-Z0-9À-ž\s]/g, "")

answered Feb 21 '18 at 11:43

Dmitri R117

2,502
23
20

score 3 · Answer 8 · answered Apr 19 '20 at 03:01

3

To replace with dashes, do the following:

text.replace(/[\W_-]/g,' ');

answered Apr 19 '20 at 03:01

Gregory R.

1,815
1
20
32

score 1 · Answer 9 · answered Aug 30 '20 at 10:48

1

For anyone still strugging (like me...) after the above more expert replies, this works in Visual Studio 2019:

outputString = Regex.Replace(inputString, @"\W", "_");

Remember to add

using System.Text.RegularExpressions;

answered Aug 30 '20 at 10:48

egginstone

99
2
7

score 0 · Answer 10 · answered Jul 28 '22 at 07:56

0

const processStirng = (str) => (
    str
    .replace(/[^a-z0-9\s]/gi, '') // remove all but alpha-numeric and spaces
    .replace(/ +/g, ' '); // remove duplicated spaces
);
processSting(' $ your    string    here #');

answered Jul 28 '22 at 07:56

Vaha

2,179
2
17
29

While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value and helps the OP and others understand the logic behind it. – stelioslogothetis Jul 28 '22 at 11:54

Replace all non alphanumeric characters, new lines, and multiple white space with one space

10 Answers10

Linked