214

If I have a string with any type of non-alphanumeric character in it:

"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation"

How would I get a no-punctuation version of it in JavaScript:

"This is an example of a string with punctuation"
hichris123
  • 10,145
  • 15
  • 56
  • 70
Quentin Fisk
  • 2,141
  • 2
  • 13
  • 3

17 Answers17

262

If you want to remove specific punctuation from a string, it will probably be best to explicitly remove exactly what you want like

replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"")

Doing the above still doesn't return the string as you have specified it. If you want to remove any extra spaces that were left over from removing crazy punctuation, then you are going to want to do something like

replace(/\s{2,}/g," ");

My full example:

var s = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var punctuationless = s.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"");
var finalString = punctuationless.replace(/\s{2,}/g," ");

Results of running code in firebug console:

alt text

Mike Grace
  • 16,636
  • 8
  • 59
  • 79
  • 5
    Curly braces in regex apply a quantifier to the preceding, so in this case it's replacing between 2 and 100 whitespace characters (`\s`) with a single space. If you want to collapse any number of whitespace characters down to one, you would leave off the upper limit like so: `replace(/\s{2,}/g, ' ')`. – Mike Partridge Sep 27 '11 at 12:33
  • 13
    I've added a few more chars to list of punctuation replaced (`@+?><[]+`): `replace(/[\.,-\/#!$%\^&\*;:{}=\-_\`~()@\+\?><\[\]\+]/g, '')`. If anyone is looking for a yet-slightly-more-complete set. – timmfin Jan 24 '14 at 18:27
  • 9
    Python's string.punctuation defines punctuation as: ```!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~``` Which works better for me, so another alternative would be: ```replace(/['!"#$%&\\'()\*+,\-\.\/:;<=>?@\[\\\]\^_`{|}~']/g,"");``` – 01AutoMonkey Aug 31 '14 at 20:28
  • I think you're missing `[]` from the list of punctuation characters. – Alix Axel Sep 19 '15 at 12:44
  • I think the part with `... ,-\/ ... ` in your regex is misleading. The `-` here will be interpreted like the range operator. You want: `... ,\-\/ ... `. It doesn't make a difference in this case because you're trying to be comprehensive, but if you want to remove the period for instance it wouldn't work (because the period is within the range) – Antoine Lizée Jan 11 '16 at 19:12
  • 1
    @AntoineLizée I agree that it's misleading. Updated the answer. Thanks. – Mike Grace Jan 12 '16 at 00:52
  • @MikeGrace But now it's twice :-) – Antoine Lizée Jan 15 '16 at 00:32
  • @AntoineLizée and that's what I get for not carefully rereading all my code. – Mike Grace Jan 15 '16 at 14:43
  • @MikeGrace but what to do when you have "text?". Your replace will remove the word all together. – trusk Jun 21 '16 at 17:15
  • 2
    I've tried with "it?" - doesn't work for me (https://regex101.com/r/F4j5Qc/1), the right solution is: /[.,\/#!$%\^&\*;:{}=\-_`~()\?]/g – Maxim Firsoff Jan 28 '17 at 08:55
  • I know this is super old, but thank you! Had to find out the hard way why the for loop needs to run backwards :) – Max Pekarsky Oct 18 '17 at 16:51
  • 1
    Just wanting to point out also that if you want to remove `[` and `]` then you'll need to add them into the `[]`, like so: `/[\[\]?.,\/#!$%\^&\*;:{}=\-_`~()]/g` – MalcolmOcean Nov 02 '18 at 01:25
  • 1
    This is a bad answer. The question was "how do I remove **all punctuation** from a string?", not "how do I remove specific characters from a string?". Yes, there is a valid use case for this pattern, but it should not be the top+accepted answer without a better explanation for why. – Tom Lord Jun 19 '20 at 12:16
  • 2
    2020 update: all browsers now support unicode character classes in regexp... `var punctuationless = s.replace(/[^\p{L}\s]/gu,"");` works everywhere today. – Bill Barry Oct 01 '20 at 20:35
188
str = str.replace(/[^\w\s\']|_/g, "")
         .replace(/\s+/g, " ");

Removes everything except alphanumeric characters and whitespace, then collapses multiple adjacent whitespace to single spaces.

Detailed explanation:

  1. \w is any digit, letter, or underscore.
  2. \s is any whitespace.
  3. [^\w\s\'] is anything that's not a digit, letter, whitespace, underscore or a single quote.
  4. [^\w\s\']|_ is the same as #3 except with the underscores added back in.
André Levy
  • 280
  • 2
  • 13
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • 103
    This will also strip out non-English but otherwise perfectly alphanumeric characters like à, é, ö, as well as the entire Cyrillic alphabet. – Dan Abramov Mar 01 '12 at 13:40
  • 7
    @quemeful I disagree, the original question does not specify "for english only". SO is quite international, used all over the world. Anyone who speaks English and has internet access can use it. If the language is not specified in the question, then we should not be making any assumptions. We are in 2017, dammit! – Rolf Nov 08 '17 at 21:47
  • 2
    Also, even if you only support English you have loan words like résumé and names of places or people so you wouldn't want to break someone's ability to say they work in San José (the official spelling) in the cubicle between Ramón Chloé. – Chris Adams Dec 14 '17 at 18:38
  • 2
    This will mess with words such as `wouldn't` and `don't` – Bruno Francisco Feb 14 '19 at 13:50
  • what's the second `.replace(/\s+/g, " ");` accomplish here? – njboot Mar 31 '21 at 13:29
  • 2
    @njboot It collapses multiple adjacent whitespace to single spaces. – John Kugelman Mar 31 '21 at 15:23
  • Agreed, @null. Corrected. – André Levy Apr 29 '22 at 06:26
  • This doesn't work with it. :(. text = "The Fox asked the stork, 'How is the soup?'", – Brian Patterson May 01 '22 at 04:36
  • If you want to strip ALL punctuation, replace the regex in the first expression with this .. `/[^\w\s]/g` – Brian Patterson May 01 '22 at 04:45
81

Here are the standard punctuation characters for US-ASCII: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

For Unicode punctuation (such as curly quotes, em-dashes, etc), you can easily match on specific block ranges. The General Punctuation block is \u2000-\u206F, and the Supplemental Punctuation block is \u2E00-\u2E7F.

Put together, and properly escaped, you get the following RegExp:

/[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]/

That should match pretty much any punctuation you encounter. So, to answer the original question:

var punctRE = /[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]/g;
var spaceRE = /\s+/g;
var str = "This, -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
str.replace(punctRE, '').replace(spaceRE, ' ');

>> "This is an example of a string with punctuation"

US-ASCII source: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#posix

Unicode source: http://kourge.net/projects/regexp-unicode-block

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Joseph
  • 3,127
  • 2
  • 21
  • 11
  • 3
    For Unicode punctuation, the blocks are not enough. You have to look at the general category Punctuation, and you will see that not all punctuations are nicely located in those blocks. There are many familiar punctuations inside Latin blocks, for example. – nhahtdh Aug 03 '15 at 04:03
  • Very useful answer, may not work for corner cases of some languages but better than earlier ones! Thanks – mayank Jan 14 '22 at 08:26
24

As of 2021, many modern browsers support JavaScript built-in: RegExp: Unicode property escapes. So you can now simply use \p{P}:

str.replace(/[\p{P}$+<=>^`|~]/gu, '')

The regex can be further simplified if you want to ignore all symbols (\p{S}) and punctuations.

str.replace(str.replace(/[\p{P}\p{S}]/gu, '')

If you want to strip everything except letters (\p{L}), numbers (\p{N}) and separators (\p{Z}). You may use a negated character set like this (works for non-English alphanumeric characters too):

str.replace(/[^\p{L}\p{N}\p{Z}]/gu, '')

The above regex works, but more common use-case is to use regex whitespace class instead of Unicode separator character set as the latter does not include tabs and line feed. Try this:

str.replace(/[^\p{L}\p{N}\s]/gu, '')

const str = 'This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation';

console.log(str.replace(/[\p{P}$+<=>^`|~]/gu, ''));
console.log(str.replace(/[\p{P}\p{S}]/gu, ''));
console.log(str.replace(/[^\p{L}\p{N}\p{Z}]/gu, ''));
console.log(str.replace(/[^\p{L}\p{N}\s]/gu, ''));

You may also like to chain a .replace(/ +/g, ' ') to remove consecutive spaces.

Feel free to play around with these! Ref:
Unicode Character Properties - Wikipedia
Unicode Property Escapes - MDN

brc-dd
  • 10,788
  • 3
  • 47
  • 67
16

/[^A-Za-z0-9\s]/g should match all punctuation but keep the spaces. So you can use .replace(/\s{2,}/g, " ") to replace extra spaces if you need to do so. You can test the regex in http://rubular.com/

.replace(/[^A-Za-z0-9\s]/g,"").replace(/\s{2,}/g, " ")

Update: Will only work if the input is ANSI English.

adnan2nd
  • 2,083
  • 20
  • 16
  • 7
    You are assuming that the string is ANSI English. Not French with accented letters (àéô), nor German, Turkish. Unicode Arabic, Chinese, etc. will also disappear. – Rolf Nov 08 '17 at 21:35
  • 2
    Thanks, did not think about that completely. – adnan2nd Nov 12 '17 at 07:10
15

I ran across the same issue, this solution did the trick and was very readable:

var sentence = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var newSen = sentence.match(/[^_\W]+/g).join(' ');
console.log(newSen);

Result:

"This is an example of a string with punctuation"

The trick was to create a negated set. This means that it matches anything that is not within the set i.e. [^abc] - not a, b or c

\W is any non-word, so [^\W]+ will negate anything that is not a word char.

By adding in the _ (underscore) you can negate that as well.

Make it apply globally /g, then you can run any string through it and clear out the punctuation:

/[^_\W]+/g

Nice and clean ;)

jacobedawson
  • 2,929
  • 25
  • 27
  • 1
    You also change all new lines into space with this method. – nhahtdh Aug 03 '15 at 04:55
  • 7
    This method only works in English, all accented characters are removed. – NicolasBernier Jul 10 '17 at 07:50
  • @NicolasBernier yeah that's 100% correct - JavaScript's regex engine is actually pretty lame (see: https://stackoverflow.com/questions/4043307/why-this-regex-is-not-working-for-german-words) - unfortunately for more complex tasks (and to create patterns for non-English words) it takes a fair bit more code. Still, for a quick & concise regex to strip punctuation it works :) – jacobedawson Jul 10 '17 at 11:27
  • This was the simplest and served my purpose well. – James Shrum Jan 21 '19 at 14:09
12

In a Unicode-aware language, the Unicode Punctuation character property is \p{P} — which you can usually abbreviate \pP and sometimes expand to \p{Punctuation} for readability.

Are you using a Perl Compatible Regular Expression library?

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 9
    Unfortunately JS isn't Perl compatible. The other problem is when I tested this it didn't capture all of the punctuation in @Quentin's test string => http://mikegrace.s3.amazonaws.com/forums/stack-overflow/regex-punctuation-capture-not-capture-all.png – Mike Grace Dec 01 '10 at 20:45
  • 4
    You can use the XRegExp library to get this extended syntax. – Eirik Birkeland Sep 11 '16 at 12:50
  • 2
    As of 2020, this should be the answer as modern browsers support the unicode character classes – Jarede Oct 27 '20 at 17:37
11

If you want to remove punctuation from any string you should use the P Unicode class.

But, because classes are not accepted in the JavaScript RegEx, you could try this RegEx that should match all the punctuation. It matches the following categories: Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So GeneralPunctuation SupplementalPunctuation CJKSymbolsAndPunctuation CuneiformNumbersAndPunctuation.

I created it using this online tool that generates Regular Expressions specifically for JavaScript. That's the code to reach your goal:

var punctuationRegEx = /[!-/:-@[-`{-~¡-©«-¬®-±´¶-¸»¿×÷˂-˅˒-˟˥-˫˭˯-˿͵;΄-΅·϶҂՚-՟։-֊־׀׃׆׳-״؆-؏؛؞-؟٪-٭۔۩۽-۾܀-܍߶-߹।-॥॰৲-৳৺૱୰௳-௺౿ೱ-ೲ൹෴฿๏๚-๛༁-༗༚-༟༴༶༸༺-༽྅྾-࿅࿇-࿌࿎-࿔၊-၏႞-႟჻፠-፨᎐-᎙᙭-᙮᚛-᚜᛫-᛭᜵-᜶។-៖៘-៛᠀-᠊᥀᥄-᥅᧞-᧿᨞-᨟᭚-᭪᭴-᭼᰻-᰿᱾-᱿᾽᾿-῁῍-῏῝-῟῭-`´-῾\u2000-\u206e⁺-⁾₊-₎₠-₵℀-℁℃-℆℈-℉℔№-℘℞-℣℥℧℩℮℺-℻⅀-⅄⅊-⅍⅏←-⏧␀-␦⑀-⑊⒜-ⓩ─-⚝⚠-⚼⛀-⛃✁-✄✆-✉✌-✧✩-❋❍❏-❒❖❘-❞❡-❵➔➘-➯➱-➾⟀-⟊⟌⟐-⭌⭐-⭔⳥-⳪⳹-⳼⳾-⳿⸀-\u2e7e⺀-⺙⺛-⻳⼀-⿕⿰-⿻\u3000-〿゛-゜゠・㆐-㆑㆖-㆟㇀-㇣㈀-㈞㈪-㉃㉐㉠-㉿㊊-㊰㋀-㋾㌀-㏿䷀-䷿꒐-꓆꘍-꘏꙳꙾꜀-꜖꜠-꜡꞉-꞊꠨-꠫꡴-꡷꣎-꣏꤮-꤯꥟꩜-꩟﬩﴾-﴿﷼-﷽︐-︙︰-﹒﹔-﹦﹨-﹫!-/:-@[-`{-・¢-₩│-○-�]|\ud800[\udd00-\udd02\udd37-\udd3f\udd79-\udd89\udd90-\udd9b\uddd0-\uddfc\udf9f\udfd0]|\ud802[\udd1f\udd3f\ude50-\ude58]|\ud809[\udc00-\udc7e]|\ud834[\udc00-\udcf5\udd00-\udd26\udd29-\udd64\udd6a-\udd6c\udd83-\udd84\udd8c-\udda9\uddae-\udddd\ude00-\ude41\ude45\udf00-\udf56]|\ud835[\udec1\udedb\udefb\udf15\udf35\udf4f\udf6f\udf89\udfa9\udfc3]|\ud83c[\udc00-\udc2b\udc30-\udc93]/g;
var string = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var newString = string.replace(punctuationRegEx, '').replace(/(\s){2,}/g, '$1');
console.log(newString)
Salvatore
  • 499
  • 10
  • 16
9

I'll just put it here for others.

Match all punctuation chars for for all languages:

Constructed from Unicode punctuation category and added some common keyboard symbols like $ and brackets and \-=_

http://www.fileformat.info/info/unicode/category/Po/list.htm

basic replace:

".test'da, te\"xt".replace(/[\-=_!"#%&'*{},.\/:;?\(\)\[\]@\\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g,"")
"testda text"

added \s as space

".da'fla, te\"te".split(/[\s\-=_!"#%&'*{},.\/:;?\(\)\[\]@\\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g)

added ^ to invert patternt to match not punctuation but the words them selves

".test';the, te\"xt".match(/[^\s\-=_!"#%&'*{},.\/:;?\(\)\[\]@\\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g)

for language like Hebrew maybe to remove " ' the single and the double quote. and do more thinking on it.

using this script:

step 1: select in Firefox holding control a column of U+1234 numbers and copy it, do not copy U+12456 they replace English

step 2 (i did in chrome)find some textarea and paste it into it then rightclick and click inspect. then you can access the selected element with $0.

var x=$0.value
var z=x.replace(/U\+/g,"").split(/[\r\n]+/).map(function(a){return parseInt(a,16)})
var ret=[];z.forEach(function(a,k){if(z[k-1]===a-1 && z[k+1]===a+1) { if(ret[ret.length-1]!="-")ret.push("-");} else {  var c=a.toString(16); var prefix=c.length<3?"\\u0000":c.length<5?"\\u0000":"\\u000000"; var uu=prefix.substring(0,prefix.length-c.length)+c; ret.push(c.length<3?String.fromCharCode(a):uu)}});ret.join("")

step 3 copied over the first letters the ascii as separate chars not ranges because someone might add or remove individual chars

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Shimon Doodkin
  • 4,310
  • 34
  • 37
6

For en-US ( American English ) strings this should suffice:

"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation".replace( /[^a-zA-Z ]/g, '').replace( /\s\s+/g, ' ' )

Be aware that if you support UTF-8 and characters like chinese/russian and all, this will replace them as well, so you really have to specify what you want.

meder omuraliev
  • 183,342
  • 71
  • 393
  • 434
3

If you want to retain only alphabets and spaces, you can do:

str.replace(/[^a-zA-Z ]+/g, '').replace('/ {2,}/',' ')
codaddict
  • 445,704
  • 82
  • 492
  • 529
3

if you are using lodash

_.words('This, is : my - test,line:').join(' ')

This Example

_.words('"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation"').join(' ')
Pankaj Avhad
  • 344
  • 2
  • 3
2

As per Wikipedia's list of punctuations I had to build the following regex which detects punctuations :

[\.’'\[\](){}⟨⟩:,،、‒–—―…!.‹›«»‐\-?‘’“”'";/⁄·\&*@\•^†‡°”¡¿※#№÷׺ª%‰+−=‱¶′″‴§~_|‖¦©℗®℠™¤₳฿₵¢₡₢$₫₯֏₠€ƒ₣₲₴₭₺₾ℳ₥₦₧₱₰£៛₽₹₨₪৳₸₮₩¥]

Tushar Goswami
  • 753
  • 1
  • 8
  • 19
  • 2
    If using this regex, you should also escape your regex delimiter. For example, if you use `/` (most common) then it should be escaped inside the character class above by adding a back-slash before, like this: `\/`. This is how you would use it: `"String!! With, Punctuation.".replace(/[\.’'\[\](){}⟨⟩:,،、‒–—―…!.‹›«»‐\-?‘’“”'";\/⁄·\&*@\•^†‡°”¡¿※#№÷׺ª%‰+−=‱¶′″‴§~_|‖¦©℗®℠™¤₳฿₵¢₡₢$₫₯֏₠€ƒ₣₲₴₭₺₾ℳ₥₦₧₱₰£៛₽₹₨₪৳₸₮₩¥]+/g,"")`. By the way, I don't see the backtick (`) anywhere in there, how come? – Rolf Nov 08 '17 at 22:05
  • ´ is missing. Seems to be hard to find a list of all punctuations. – Alex Dec 08 '17 at 13:08
2

I think the simplest solution is:

.replaceAll(/[^a-zA-Z0-9]/g,"");

Instead of filtering out every single non-character item, you just check if the character doesn't fit what you're looking for.

0

It depends on what you are trying to return. I used this recently:

return text.match(/[a-z]/i);
0

If you are targeting a modern browsers (not IE) you can utilize unicode caracter classes. This is especially helpful when you also need to support caracters like german Umlaute (äöü) or else.

Here is what I ended up with. It replaces everything that is not a letter or apostrophe or whitespace and removes multiple whitespaces in row with a single one.

const textStripped = text
  .replace(/[’]/g, "'") // replace ’ with '
  .replace(/[^\p{Letter}\p{Mark}\s']/gu, "") // remove everything that is not a letter, mark, space or '
  .replace(/\s+/g, " ") // remove multiple spaces
.replace(/[’]/g, "'")

First replaces ’ (typographic apostrophe) with ' (typewriter apostrophe). As both may be used for words like "dont’t"

.replace(/[^\p{Letter}\p{Mark}\s']/gu, "")

\p{Letter} stands for any caracter that is categorized as a letter in unicode.

The \p{Mark} category needs to be included to further cover letter mark combinations. For example a german ä can be encoded as a single caracter or as a combination of "a" and a Mark. This happens quite regularly when copying german texts from PDFs.

Source: https://dev.to/tillsanders/let-s-stop-using-a-za-z-4a0m

Andreas Riedmüller
  • 1,217
  • 13
  • 17
-1

Its simple just replace character other than words:

.replace(/[^\w]/g, ' ')
James Risner
  • 5,451
  • 11
  • 25
  • 47