419

I need to remove all characters from a string which aren't in a-z A-Z 0-9 set or are not spaces.

Does anyone have a function to do this?

kenorb
  • 155,785
  • 88
  • 678
  • 743
zuk1
  • 18,009
  • 21
  • 59
  • 63

7 Answers7

815

Sounds like you almost knew what you wanted to do already, you basically defined it as a regex.

preg_replace("/[^A-Za-z0-9 ]/", '', $string);
lsl
  • 4,371
  • 3
  • 39
  • 54
Chad Birch
  • 73,098
  • 23
  • 151
  • 149
  • 11
    zuk1: regexbuddy is a great help with that – relipse May 12 '14 at 17:13
  • 3
    Here's an example if you want to include the hyphen as an allowed character. I needed this because I needed to strip out disallowed characters from a Moodle username, based on email addresses: preg_replace("/[^a-z0-9_.@\-]/", '', $string); – Evan Donovan May 22 '14 at 15:17
  • 2
    Would this work exactly the same with apostrophes (single-quotes) around the regular expression, instead of quotation marks (double-quotes)? E.g: `preg_replace('/[^A-Za-z0-9 ]/', '', $string);` – 2540625 Mar 20 '15 at 17:46
  • 4
    We want explanation about this :) . People come here to see Why it is the way it is. Please consider Regex explanation too! Thanks – Pratik Joshi Dec 06 '15 at 10:44
  • 4
    What if we want to keep accentued characters? – wonzbak Jun 23 '16 at 09:00
  • Does it matter single or double quote? – Ömer An May 15 '20 at 19:39
  • as noted by @wonzbak this does not keep accent chars – albanx Apr 13 '22 at 14:48
190

For unicode characters, it is :

preg_replace("/[^[:alnum:][:space:]]/u", '', $string);
voondo
  • 2,533
  • 1
  • 16
  • 21
  • hi voondo , what's with the /ui thing.. what do you call it ? can anyone please shed me some light. Thank you. – Kevin Florenz Daus Feb 28 '14 at 07:39
  • 6
    For clarification, they're called flags. They're put after the closing delimiter (in this case it's "/", but it could be "~" or "@" or whatever character you want to use as long as the opening and closing delimiters are the same) and change the behavior of the expression. – Doktor J Apr 13 '14 at 22:04
  • 1
    Btw, `\w` includes `\d` and so the `\d` is unnecessary. Also, this is wrong because it will also leave underscores in the resulting string (which is also included in `\w`). – smathy Aug 16 '14 at 20:42
  • 3
    There's still an error in this, the character classes need to be terminated with ':]' so the correct line would be: preg_replace("/[^[:alnum:][:space:]]/ui", '', $string); – h00ligan Nov 17 '14 at 14:03
  • 5
    Is the `i` flag really necessary here since `[:alnum:]` already covers both cases? – But those new buttons though.. Sep 25 '15 at 12:28
  • This solution worked until i migrated to php 7.3, replaced with ```preg_replace("/[^a-z\d\s]/iu", '', $str);``` – pgee70 Nov 17 '19 at 23:31
  • this solution works fine with php 8 as well. I think is the best – albanx Apr 13 '22 at 14:52
59

Regular expression is your answer.

$str = preg_replace('/[^a-z\d ]/i', '', $str);
  • The i stands for case insensitive.
  • ^ means, does not start with.
  • \d matches any digit.
  • a-z matches all characters between a and z. Because of the i parameter you don't have to specify a-z and A-Z.
  • After \d there is a space, so spaces are allowed in this regex.
topher
  • 14,790
  • 7
  • 54
  • 70
raspi
  • 5,962
  • 3
  • 34
  • 51
  • 4
    We want explanation about this :) . People come here to see Why it is the way it is. Please consider Regex explanation too! Not everyone is advanced enough to know what you wrote there without explanation. Thanks – Pratik Joshi Dec 06 '15 at 10:48
  • @PratikCJoshi The i stands for case insensitive. ^ means, does not start with. \d matches any digit. a-z matches all characters between a and z. Because of the i parameter you don't have to specify a-z and A-Z. After \d there is a space, so spaces are allows in this regex. – bart Feb 10 '16 at 04:21
  • 1
    People **don't** read comments as answer. Please update answer! – Pratik Joshi Feb 10 '16 at 08:54
38

If you need to support other languages, instead of the typical A-Z, you can use the following:

preg_replace('/[^\p{L}\p{N} ]+/', '', $string);
  • [^\p{L}\p{N} ] defines a negated (It will match a character that is not defined) character class of:
    • \p{L}: a letter from any language.
    • \p{N}: a numeric character in any script.
    • : a space character.
  • + greedily matches the character class between 1 and unlimited times.

This will preserve letters and numbers from other languages and scripts as well as A-Z:

preg_replace('/[^\p{L}\p{N} ]+/', '', 'hello-world'); // helloworld
preg_replace('/[^\p{L}\p{N} ]+/', '', 'abc@~#123-+=öäå'); // abc123öäå
preg_replace('/[^\p{L}\p{N} ]+/', '', '你好世界!@£$%^&*()'); // 你好世界

Note: This is a very old, but still relevant question. I am answering purely to provide supplementary information that may be useful to future visitors.

Jonathon
  • 15,873
  • 11
  • 73
  • 92
17

here's a really simple regex for that:

\W|_

and used as you need it (with a forward / slash delimiter).

preg_replace("/\W|_/", '', $string);

Test it here with this great tool that explains what the regex is doing:

http://www.regexr.com/

scrollup
  • 206
  • 4
  • 12
Alex Stephens
  • 3,017
  • 1
  • 36
  • 41
  • 1
    You still need the `/u` flag otherwise non-ascii letters are also removed. – Xeoncross Dec 30 '14 at 19:52
  • Neat [but would also match spaces](https://www.regex101.com/r/afwxAB/1) and if this is wanted, probably could double the performance by use of a *character class* and additional *quantifier* for *one or more* [`[\W_]+`](https://www.regex101.com/r/afwxAB/2) – bobble bubble Dec 31 '16 at 02:00
16
[\W_]+

 

$string = preg_replace("/[\W_]+/u", '', $string);

It select all not A-Z, a-z, 0-9 and delete it.

See example here: https://regexr.com/3h1rj

<?php

$strings="

_____________________
--> Welcome to RegExr v2.1 by gskinner.com, proudly hosted by Media Temple!

Edit the Expression & Text to see matches. Roll over matches or the expression for details. Undo mistakes with ctrl-z. Save Favorites & Share expressions with friends or the Community. Explore your results with Tools. A full Reference & Help is available in the Library, or watch the video Tutorial.

Sample text for testing: ª²³µ - Académie Française ______________---__
abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789 _+-.,!@#$%^&*();\\/|<>\"\'
12345 -98.7 3.141 .6180 9,000 +42
555.123.4567    +1-(800)-555-2468
foo@demo.net    bar.ba@test.co.uk
www.demo.com    http://foo.co.uk/
http://regexr.com/foo.html?q=bar
https://mediatemple.net
";

/* No line break */
$string = preg_replace("/[\W]+/u", '', $strings);
echo "Option 1:".$string;
/* Keep line break */
$string = preg_replace("/[^\n\w]+/u", '', $strings);
echo "\n\nOption 2:". $string;
?>

Output for php 8.1.12

Option 1: _____________________WelcometoRegExrv21bygskinnercomproudlyhostedbyMediaTempleEdittheExpressionTexttoseematchesRollovermatchesortheexpressionfordetailsUndomistakeswithctrlzSaveFavoritesShareexpressionswithfriendsortheCommunityExploreyourresultswithToolsAfullReferenceHelpisavailableintheLibraryorwatchthevideoTutorialSampletextfortestingª²³µAcadémieFrançaise________________abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_1234598731416180900042555123456718005552468foodemonetbarbatestcoukwwwdemocomhttpfoocoukhttpregexrcomfoohtmlqbarhttpsmediatemplenet

Option 2: 

_____________________
WelcometoRegExrv21bygskinnercomproudlyhostedbyMediaTemple

EdittheExpressionTexttoseematchesRollovermatchesortheexpressionfordetailsUndomistakeswithctrlzSaveFavoritesShareexpressionswithfriendsortheCommunityExploreyourresultswithToolsAfullReferenceHelpisavailableintheLibraryorwatchthevideoTutorial

Sampletextfortestingª²³µAcadémieFrançaise________________
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789_
1234598731416180900042
555123456718005552468
foodemonetbarbatestcouk
wwwdemocomhttpfoocouk
httpregexrcomfoohtmlqbar
httpsmediatemplenet
Intacto
  • 527
  • 3
  • 8
  • 1
    what does this regex /[\W_]+/u means ? – Ângelo Rigo Dec 04 '17 at 17:38
  • 2
    `\W` is the inverse of `\w` which are characters `A-Za-z0-9_`. So `\W` will match any character that is not `A-Za-z0-9_` and remove them. The `[]` is a [character set boundary](https://www.regular-expressions.info/charclass.html). The`+` is redundant on a character set boundary but normally means 1 or more character. The `u` flag expands the expression to include unicode character support, meaning it will not remove characters beyond character code 255 such as `ª²³µ` . Example of various usages https://3v4l.org/hSVV5 with unicode and ascii characters. – Will B. Apr 25 '19 at 14:33
3
preg_replace("/\W+/", '', $string)

You can test it here : http://regexr.com/

Goku
  • 2,157
  • 2
  • 16
  • 31