120

I'm looking for a php function that will take an input string and return a sanitized version of it by stripping away all special characters leaving only alpha-numeric.

I need a second function that does the same but only returns alphabetic characters A-Z.

Any help much appreciated.

Scott B
  • 38,833
  • 65
  • 160
  • 266
  • Which Unicode Normalization Form are these in, and whyever would you want to do this? – tchrist Mar 04 '11 at 21:03
  • 1
    When you say A-Z and 'alphanumeric', do you really mean only A-Z or do you want to match all letters from all languages, including foreign languages and obsolete scripts? – Mark Byers Mar 04 '11 at 21:04
  • If you’e doing this so you can do an accent-insensitive string comparison, you’re doing the wrong thing. – tchrist Mar 04 '11 at 21:08
  • 3
    It’s **not** just “from all languages”. It’s English. English uses the Latin script. There are `unichars '\p{Latin}' '\p{Alphabetic}' '[^A-Za-z]' | wc -l` == 1192 code points that are Latin alphabetics but which are not A-Z. It is commonly held myth that ASCII is enough for English. It’s not, and that’s why writing A-Z has a **code smell** to it. – tchrist Mar 04 '11 at 21:10
  • @Mark: At present I'm only interested in English. – Scott B Mar 04 '11 at 21:15
  • 1
    @Scott B: English doesn't just use the 26 letters from A-Z. For example the word résumé includes é. Perhaps you could explain what you are trying to do as this might help get you better answers. – Mark Byers Mar 04 '11 at 21:17
  • @Mark, point taken. The function is used in a routine which takes a "primary keyword phrase" and evaluates a given block of html for appearances of the keyword phrase. The app is currently in US English, but it would be great to extend the reach. – Scott B Mar 04 '11 at 21:44
  • @Scott: If you turn `résumé` into `rsum`, you’ll lose, and if you turn it into `resume`, you will retrieve too many false positives. – tchrist Mar 04 '11 at 21:58
  • @tchrist: good point. What do you suggest to account for these special characters? – Scott B Mar 04 '11 at 22:20
  • @Scott: If what you’re searching is in Unicode, then you should not limit people to ASCII queries. If you are using Unicode, then you should look into several things: decomposition forms in both canonical and especially also compatible modes, string comparison the Unicode Collation Algorithm with allows for not just case-insensitivity but also accent- and/or punctuation- insensitivity, string sanitizing via RFC 3454’s “Preparation for International Strings (stringprep)”, default ignorable code points, and stuff like that. – tchrist Mar 05 '11 at 00:03

3 Answers3

257

Warning: Note that English is not restricted to just A-Z.

Try this to remove everything except a-z, A-Z and 0-9:

$result = preg_replace("/[^a-zA-Z0-9]+/", "", $s);

If your definition of alphanumeric includes letters in foreign languages and obsolete scripts then you will need to use the Unicode character classes.

Try this to leave only A-Z:

$result = preg_replace("/[^A-Z]+/", "", $s);

The reason for the warning is that words like résumé contains the letter é that won't be matched by this. If you want to match a specific list of letters adjust the regular expression to include those letters. If you want to match all letters, use the appropriate character classes as mentioned in the comments.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • 2
    No, an alphanumeric is `[\p{Alphabetic}\p{Numeric}]`. I forget the PCRE alphabetic property, but you can approximate it with `[\pL\pM\pN]`. – tchrist Mar 04 '11 at 21:02
  • 1
    @tchrist: I assume that because he specifically mentioned A-Z that he only wants to match that, though I admit that the question could be a lot more clear on this point. I'll ask for a clarification. – Mark Byers Mar 04 '11 at 21:03
  • 1
    @Mark, I wasn’t arguing with the second part of your answer, although if he hasn’t canonically decomposed the string first, it won’t work right. I was arguing with the first part. Also, I try to always right regexes that work on **any** data, not just on moldy old ASCII. :) Hence the mantra that **this side of Millennium, `[A-Z]` is always wrong, *sometimes* .** – tchrist Mar 04 '11 at 21:05
  • isn't `/[^a-zA-Z0-9]+/` the same as `/[^a-z0-9]+/i` ? – JD Isaacks Mar 04 '11 at 21:26
  • @John Isaacks: Yes, I think that's true almost always. Not sure if it's true in Turkey though. I don't have a Turkish computer on which I can test, but it depends on how PHP handles the Turkish I. Do you prefer `/.../i`? – Mark Byers Mar 04 '11 at 21:29
  • 1
    @Mark Byers, I see.. and Yes I prefer the `i` but I have only ever has to worry about an English demographic .. I forget many people have to think about other languages. BTW I just noticed you are the highest rep'd user who has never asked 1 question. Even Jon Skeet has asked questions before! – JD Isaacks Mar 04 '11 at 21:38
  • @Mark: You’re exactly right about Turkish I’s (`İ` and `ı`). Ranges are a problem because they often miss things. Unicode case insensitive matches are not the same as spelled out literals, because people forget that `/i/i` matches U+130 İ, that `/k/i` matches K U+212A KELVIN SIGN, that `/f/i` matches ligatures like `ffi` and `fi`, that `/s/i` matches `ſ` and `ß` and `ſt` etc. However, `/a/i` still doesn’t match `æ` nor does `/d/i` match `dz` (etc), so you may need to think about using the Unicode Collation Algorithm’s matchers, which are by default accent-insensitive. – tchrist Mar 04 '11 at 21:48
  • I would like to keep spaces, how would I do that? – James Wilson Sep 06 '13 at 09:58
  • 1
    why is there a + at the end of the regexp? Wouldn't it be ... same if you remove it? – Dennis Apr 11 '14 at 15:27
  • Both links goes to a blank page. Could you fix that? – Michel Ayres May 13 '14 at 12:17
  • It's funny how such a simple and straight-forward question received so many completely irrelevant answers and comments (just another day on SO it seems...), Mark's answer is a wonderfully simple way to do exactly what OP asked for, nothing more, nothing less, +1 – KittenCodings Dec 28 '15 at 15:51
1

try this to keep accentuated characters:

$result = preg_replace("/[^A-zÀ-ú0-9]+/", "", $s);
Oli
  • 1,622
  • 18
  • 14
-1

Rather than preg_replace, you could always use PHP's filter functions using the filter_var() function with FILTER_SANITIZE_STRING.

samayo
  • 16,163
  • 12
  • 91
  • 106
Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • Does PHP have access to the ISO Stringprep algorithm? I know Perl and Java do. – tchrist Mar 04 '11 at 21:20
  • I believe the string filter function works predominantly with 7-bit ASCII, but don't quote me on that. – Mark Baker Mar 04 '11 at 21:26
  • 36
    Please, can you tell us an explicit way of doing what the user is asking for using `FILTER_SANITIZE_STRING`? To my knowledge, the closest that can be archieved this way is with `FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH`, but that won't leave just letters and numbers but also dots, slashes, percents and that all. – Pere Apr 10 '14 at 10:20
  • $iMycleanVar= filter_var($sStringWithNumbers, FILTER_SANITIZE_NUMBER_INT); – Sultanos Dec 04 '17 at 15:29
  • 5
    It looks more like a comment rather than an answer. Give a proper explanation while writing an answer. – Siraj Alam Jun 18 '18 at 17:36
  • 3
    I don't believe there is an actual FILTER_SANITIZE to alphanumeric on there, unfortunately. Pretty major omission. – Kzqai Jan 14 '20 at 21:01
  • Comments require the EXACT code necessary to some the OP's question, which is a step further than hinting at functions that could be used to arrive at an answer. Please see SO [help pages](StackOverflow.com/help) for more info. – SherylHohman Oct 30 '20 at 17:14