Remove all special characters, punctuation and spaces from string

Question

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.

score 544 · Accepted Answer · edited Sep 26 '19 at 22:48

544

This can be done without regex:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

You can use str.isalnum:

S.isalnum() -> bool

Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.

If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.

edited Sep 26 '19 at 22:48

wjandrea

28,235
9
60
81

answered Apr 30 '11 at 17:47

user225312

126,773
69
172
181

13

What is the reason not using regex as a rule of thumb? – Chris Dutrow Mar 18 '12 at 15:23
@ChrisDutrow regex are slower than python string built-in functions – Diego Navarro Sep 13 '12 at 14:15
This only works when the string is in *unicode*. Otherwise it complains like 'str' object has no attribute 'isalnum' 'isnumeric' and so on. – NeoJi Feb 16 '16 at 01:47
18

@DiegoNavarro except that's not true, I benchmarked both the `isalnum()` and regex versions, and the regex one is 50-75% faster – Francisco Jul 02 '16 at 20:51
1

Tried this in Python3 - it accepts unicode chars so it's useless to me. Try string = "B223323\§§§$3\u445454" as an example. The result? 'B2233233䑔54' – 576i Jan 17 '17 at 15:04
4

Additionally: "For 8-bit strings, this method is locale-dependent."! Thus the regex alternative is strictly better! – Antti Haapala -- Слава Україні Apr 18 '18 at 07:42
This does not preserve the white spaces between the words so not suitable for NLP – Simone Jan 12 '23 at 08:27

score 390 · Answer 2 · edited Sep 26 '19 at 22:51

390

Here is a regex to match a string of characters that are not a letters or numbers:

[^A-Za-z0-9]+

Here is the Python command to do a regex substitution:

re.sub('[^A-Za-z0-9]+', '', mystring)

edited Sep 26 '19 at 22:51

wjandrea

28,235
9
60
81

answered Apr 30 '11 at 17:46

Andy White

86,444
48
176
211

7

this also removes the spaces between words, "great place" -> "greatplace". How to avoid it? – Reihan_amn Jul 19 '17 at 19:19
27

@Reihan_amn Simply add a space to the regex, so it becomes: `[^A-Za-z0-9 ]+` – sougonde Aug 24 '17 at 09:53
1

@andy-white can you please add the space to the regex in the answer? Space is not a special character... – Ufos Aug 06 '18 at 10:21
7

I guess this doesn't work with modified character in other languages, like **á**, **ö**, **ñ**, etc. Am I right? If so, how would it be the regex for it? – HuLu ViCa Aug 08 '18 at 14:52
10

This doesn't work for Spanish, German, Danish and other languages. – tommy.carstensen Oct 07 '19 at 01:55
For those worried about accented characters, there is a solution that will work in many cases at https://stackoverflow.com/a/18663728/210945 – naught101 Nov 09 '20 at 04:11
3

just add the special characters of that particular language. For example, to use for german text, re.sub('[^A-Za-z0-9 ,.-_\'äöüÄÖÜß]+', '', sample_text) expression can be used. – EMT Aug 02 '21 at 08:56

score 78 · Answer 3 · answered Aug 07 '14 at 13:26

78

Shorter way :

import re
cleanString = re.sub('\W+','', string )

If you want spaces between words and numbers substitute '' with ' '

answered Aug 07 '14 at 13:26

tuxErrante

1,274
12
19

6

Except that _ is in \w and is a special character in the context of this question. – kkurian Nov 03 '16 at 00:11
Depends on the context - underscore is very useful for filenames and other identifiers, to the point that I don't treat it as a special character but rather a sanitised space.I generally use this method myself. – Echelon Jun 02 '18 at 08:48
6

`r'\W+'` - slightly off topic (and very pedantic) but I suggest a habit that all regex patterns be [raw strings](https://stackoverflow.com/a/13836171/673991) – Bob Stein Jun 07 '18 at 13:41
5

This procedure does not treat underscore(_) as a special character. – Md. Sabbir Ahmed Apr 05 '19 at 18:39
2

A simple change to remove `_` as well: `r"[^A-Za-z]+"` instead of `r"\W+"` – Tomerikoo Dec 02 '20 at 10:41
You're welcome! Most of the points I have come from this answer :) – tuxErrante Apr 14 '21 at 17:50

mbeacom · Answer 4 · 2021-09-18T20:39:16.263

72

TLDR

I timed the provided answers.

import re
re.sub('\W+','', string)

is typically 3x faster than the next fastest provided top answer.

Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.

After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:

string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'

Example 1

'.join(e for e in string if e.isalnum())

string1 - Result: 10.7061979771
string2 - Result: 7.78372597694

Example 2

import re
re.sub('[^A-Za-z0-9]+', '', string)

string1 - Result: 7.10785102844
string2 - Result: 4.12814903259

Example 3

import re
re.sub('\W+','', string)

string1 - Result: 3.11899876595
string2 - Result: 2.78014397621

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1.

edited Sep 18 '21 at 20:39

answered Aug 06 '16 at 01:04

mbeacom

1,466
15
25

@kkurian If you read the beginning of my answer, this is merely a comparison of the previously proposed solutions above. You might want to comment on the originating answer... http://stackoverflow.com/a/25183802/2560922 – mbeacom Nov 02 '16 at 16:03
Oh, I see where you're going with this. Done! – kkurian Nov 03 '16 at 00:10
1

Must consider Example 3, when dealing with large corpus. – HARSH NILESH PATHAK Apr 12 '19 at 16:44
Valid! Thanks for noting. – mbeacom Apr 12 '19 at 19:08
can you compare my answer `''.join([*filter(str.isalnum, string)])` – Grijesh Chauhan Apr 27 '19 at 11:52
1

Third solution didn't remove the "ø" but the second one did remove it. – MehmedB Sep 25 '19 at 10:30
@TanmeyRawal Have you tried: re.sub('\W+','', string) ? It does remove space. – mbeacom Dec 29 '21 at 22:44
```re.sub('[^A-Za-z0-9]+', '', string)``` is the most efficient and faster than the other solutions. – Kjobber Apr 05 '23 at 13:41
Solution 3 is still faster ‍♂️ based on timeit. – mbeacom Apr 06 '23 at 14:08

Grijesh Chauhan · Answer 5 · 2019-04-26T13:18:31.520

32

Python 2.*

I think just filter(str.isalnum, string) works

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:

''.join(filter(str.isalnum, string))

or to pass list in join use (not sure but can be fast a bit)

''.join([*filter(str.isalnum, string)])

note: unpacking in [*args] valid from Python >= 3.5

edited Apr 26 '19 at 13:18

answered Apr 14 '16 at 09:32

Grijesh Chauhan

57,103
20
141
208

4

@Alexey correct, In python3 `map`, `filter`, and `reduce` returns itertable object instead. Still in Python3+ I will prefer `''.join(filter(str.isalnum, string))` (or to pass list in join use `''.join([*filter(str.isalnum, string)])`) over accepted answer. – Grijesh Chauhan Mar 15 '18 at 07:26
I'm not certain `''.join(filter(str.isalnum, string))` is an improvement on `filter(str.isalnum, string)`, at least to read. Is this really the Pythreenic (yeah, you can use that) way to do this? – TheProletariat Aug 30 '18 at 04:42
1

@TheProletariat The point is *just `filter(str.isalnum, string)`* do not return string in Python3 as `filter( )` in Python-3 returns iterator rather than argument type unlike Python-2.+ – Grijesh Chauhan Aug 30 '18 at 05:31
1

@GrijeshChauhan, I think you should update your answer to include both your Python2 and Python3 recommendations. – mwfearnley Jan 02 '19 at 11:59

score 27 · Answer 6 · answered May 25 '14 at 09:28

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

you can add more special character and that will be replaced by '' means nothing i.e they will be removed.

score 19 · Answer 7 · answered Sep 05 '18 at 10:02

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.

For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

Vlad Bezden · Answer 8 · 2020-03-17T15:21:30.217

19

string.punctuation contains following characters:

'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

You can use translate and maketrans functions to map punctuations to empty values (replace)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

Output:

'This is A test'

edited Mar 17 '20 at 15:21

answered Mar 17 '20 at 15:14

Vlad Bezden

83,883
25
248
179

score 14 · Answer 9 · edited Jun 15 '18 at 19:00

14

s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)

edited Jun 15 '18 at 19:00

Zoe

27,060
21
118
148

answered Jun 15 '18 at 12:09

sneha

819
7
7

score 10 · Answer 10 · answered Apr 30 '11 at 21:07

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

score 6 · Answer 11 · edited May 18 '12 at 13:28

The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

Look at the given URL above for all related categories. You also can of course filter by the punctuation categories.

What's with the `$` at the end of each line? – John Machin Apr 30 '11 at 20:13 — John Machin, Apr 30 '11 at 20:13
If it's copy & paste issue, should you fix it then? – Olli Mar 28 '12 at 19:49 — Olli, Mar 28 '12 at 19:49

score 5 · Answer 12 · answered Jun 27 '20 at 10:00

5

For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:

Example for German:

re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)

answered Jun 27 '20 at 10:00

petezurich

9,280
9
43
57

score 5 · Answer 13 · answered May 11 '21 at 08:29

This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.

import re

sample_str = "Hel&&lo %% Wo$#rl@d"

# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))


# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)


special_char_list = ["$", "@", "#", "&", "%"]

# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)


# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)

score 4 · Answer 14 · answered Mar 23 '16 at 19:37

4

Use translate:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

Caveat: Only works on ascii strings.

answered Mar 23 '16 at 19:37

jjmurre

392
4
15

Version difference? I get `TypeError: translate() takes exactly one argument (2 given)` with py3.4 – matt wilkie Jun 20 '16 at 17:42
It is only working with Python2.7. See [below](https://stackoverflow.com/a/60725180/4820605) answer for using `translate` with Python3. – milembar Feb 04 '21 at 12:57

score 3 · Answer 15 · edited Feb 01 '21 at 21:07

3

This will remove all non-alphanumeric characters except spaces.

string = "Special $#! characters   spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))

Special characters spaces 888323

edited Feb 01 '21 at 21:07

Dharman

30,962
25
85
135

answered Feb 01 '21 at 16:57

Gobinath Viswanathan

43
6

It will not remove double spacings – Ezequiel Adrian Sep 02 '23 at 01:45

score 1 · Answer 16 · answered Jul 16 '18 at 11:52

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the

same as double quotes."""

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))

score 0 · Answer 17 · answered Oct 27 '21 at 13:21

After 10 Years, below I wrote there is the best solution. You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.

from clean_text import clean

string = 'Special $#! characters   spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'

score 0 · Answer 18 · answered Apr 06 '22 at 15:02

function regexFuntion(st) {
  const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
  st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
  st = st.replace(/\s\s+/g, ' '); // remove multiple space

  return st;
}

console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67

Dsw Wds · Answer 19 · 2016-02-25T08:20:57.247

-4

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

and you shall see your result as

'askhnlaskdjalsdk

edited Feb 25 '16 at 08:20

answered Feb 25 '16 at 08:00

Dsw Wds

482
5
17

7

wait.... you imported `re` but never used it. Your `replace` criteria only works for this specific string. What if your string is `abc = "askhnl#$%!askdjalsdk"`? I don't think will work on anything other than the `#$%` pattern. Might wanna tweak it – JChao Oct 15 '17 at 05:23