Stripping everything but alphanumeric chars from a string in Python

Question

What is the best way to strip all non alphanumeric characters from a string, using Python?

The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don't seem very 'pythonic' to me.

For the record, I don't just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.

Do you care about international alphanumeric chars, like 'æøå', 'مرحبا', 'สวัสดี', 'こんにちは' ? — Pimin Konstantin Kefaloukos, Nov 01 '14 at 08:32
@PiminKonstantinKefaloukos Yes I do care about the international chars, hence my comment on the accepted answer to use re.UNICODE. — Mark van Lent, Nov 05 '14 at 14:03

score 444 · Accepted Answer · edited Jul 04 '19 at 14:26

444

I just timed some functions out of curiosity. In these tests I'm removing non-alphanumeric characters from the string string.printable (part of the built-in string module). The use of compiled '[\W_]+' and pattern.sub('', str) was found to be fastest.

$ python -m timeit -s \
     "import string" \
     "''.join(ch for ch in string.printable if ch.isalnum())" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
    "import string" \
    "filter(str.isalnum, string.printable)"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]+', '', string.printable)"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
    "import re, string; pattern = re.compile('[\W_]+')" \
    "pattern.sub('', string.printable)" 
100000 loops, best of 3: 11.2 usec per loop

edited Jul 04 '19 at 14:26

borgr

20,175
6
25
35

answered Aug 14 '09 at 10:03

Otto Allmendinger

27,448
7
68
79

4

Very interesting results: I would have expected the regular expressions to be slower. Interestingly, I tried this with one other option (`valid_characters = string.ascii_letters + string.digits` followed by `join(ch for ch in string.printable if ch in valid_characters)` and it was 6 microseconds quicker than the `isalnum()` option. Still much slower than the regexp though. – DrAl Aug 14 '09 at 10:19
+1, measuring time is good! (but in the penultimate, do `pattern.sub('', string.printable)` instead -- silly to call re.sub when you have a RE object!-). – Alex Martelli Aug 14 '09 at 15:05
58

For the record: use `re.compile('[\W_]+', re.UNICODE)` to make it unicode safe. – Mark van Lent Aug 24 '09 at 14:01
6

how do you do it without removing the white space? – maudulus Jul 30 '14 at 19:47
8

do it without removing the white space: re.sub('[\W_]+', ' ', sentence, flags=re.UNICODE) – PALEN Apr 26 '17 at 00:55
The second solution with `filter` returns a filter object, not a string. One also has to join it. – physicalattraction Dec 20 '18 at 14:34
Shouldn't you use raw string literals? Or escape the `\`. `\W` is an erroneous escape character. – Dan M. May 06 '19 at 00:28
May just be me, but I think the argument would have been greater and more obvious if you'd done 100,000 loops on all options. Only cause people with a keen eye, or like me viewing this on a weekend and not paying full attention, didn't spot the massive difference between possible options. – AppHandwerker Sep 04 '21 at 14:39
A bit of a flawed test, since `string.printable` groups all the punctuation together giving the regular expression an unfair advantage. – Mark Ransom Oct 13 '22 at 16:57
When I used a randomly shuffled version of `string.printable` the first result with `join` remained almost unchanged from 13.0 to 13.5, while the last one using `re` went from 3.7 to 10.3. – Mark Ransom Oct 13 '22 at 17:11
@MarkvanLent: Note that this advice is not needed in Python 3 (where passing `re.UNICODE` is a no-op, and you pass `re.ASCII` to explicitly restrict the classes to match ASCII versions only). – ShadowRanger Dec 07 '22 at 16:49

score 372 · Answer 2 · edited Jan 01 '18 at 23:44

372

Regular expressions to the rescue:

import re
re.sub(r'\W+', '', your_string)

By Python definition '\W == [^a-zA-Z0-9_], which excludes all numbers, letters and _

edited Jan 01 '18 at 23:44

user1767754

23,311
18
141
164

answered Aug 14 '09 at 08:57

Ants Aasma

53,288
15
90
97

2

What does the plus sign do in the regexp? (I know what it means, just curious as to why it's needed for the re.sub.) – Mark van Lent Aug 14 '09 at 09:03
7

@Mark: I imagine it would speed up the substitution as the replace will get rid of all non-word characters in a block in one go, rather than removing them one-by-one. – DrAl Aug 14 '09 at 09:07
2

Yeah, I benched that while tuning some performance critical code a while ago. If there are significant spans of characters to replace the speedup is huge. – Ants Aasma Aug 14 '09 at 09:25
28

It might not be relevant in this case, but `\W` will keep underscores as well. – Blixt Aug 14 '09 at 16:20
16

Following @Blixt tip, if you only want letters and numbers you can do re.sub(r'[^a-zA-Z0-9]','', your_string) – Nigini Oct 24 '12 at 22:02
2

@Nigini Doing that you will throw out a whole host of valid letters. – André C. Andersen Jun 26 '17 at 09:17
1

@AndréChristofferAndersen which valid letters specifically? – 3pitt Dec 14 '17 at 21:48
@AndréC.Andersen - Which letters would [^a-zA-Z0-9] throw out in this question ? – MasterJoe Apr 27 '20 at 23:00
1

Non-English letters. Or non-ASCII charanters, if you prefer. – Jose Francisco Lopez Pimentel Jun 29 '21 at 16:21
Actually Jose is right. \W != [^a-zA-Z0-9_]: re.compile(r'\W').sub('', 'пыньк!') will give 'пыньк' but re.compile(r'[^a-zA-Z0-9_]').sub('', 'пыньк!') will give '' ``` – Jay Random Jul 12 '23 at 08:40

score 81 · Answer 3 · edited Dec 29 '21 at 14:42

81

Use the str.translate() method.

Presuming you will be doing this often:

Once, create a string containing all the characters you wish to delete:

delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())

Whenever you want to scrunch a string:

scrunched = s.translate(None, delchars)

The setup cost probably compares favourably with re.compile; the marginal cost is way lower:

C:\junk>\python26\python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop

C:\junk>\python26\python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[\W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop

Note: Using string.printable as benchmark data gives the pattern '[\W_]+' an unfair advantage; all the non-alphanumeric characters are in one bunch ... in typical data there would be more than one substitution to do:

C:\junk>\python26\python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Here's what happens if you give re.sub a bit more work to do:

C:\junk>\python26\python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop

edited Dec 29 '21 at 14:42

Neuron

5,141
5
38
59

answered Aug 15 '09 at 00:33

John Machin

81,303
11
141
189

1

Using translate is indeed quite a bit faster. Even when adding a for loop right before doing the substitution/translation (to make the setup costs weigh in less) still makes the translation roughly 17 times faster than the regexp on my machine. Good to know. – Mark van Lent Aug 18 '09 at 13:58
3

This is definitely the most pythonic solution. – codygman Sep 07 '12 at 02:43
2

This almost convince me, but I would suggest using `string.punctuation` Instead of `''.join(c for c in map(chr, range(256)) if not c.isalnum())` – ArnauOrriols Mar 14 '15 at 15:49
3

Note that this works for `str` objects but not `unicode` objects. – Yavar Oct 24 '15 at 05:07
@John Machin Is that essentially a list comprehension that's being passed as an argument to `.join()` ? – AdjunctProfessorFalcon Jul 28 '16 at 03:36
@Malvin9000 more or less – John Machin Feb 08 '17 at 10:36
Thanks John. Maybe you could add it in the answer? The original question is tagged `python`. – IanS Feb 08 '17 at 10:51
i have words like صبح بخ (urdu language) this in sentence, how can i strip that – Sunil Garg Feb 26 '21 at 08:04
1

Needs to be updated for python3! – jtlz2 Oct 27 '21 at 17:41
1

@jtlz2 I adapted this answer for Python 3 and made it a new answer here: https://stackoverflow.com/a/70310018/5906389 – jslatane Dec 10 '21 at 20:25

score 62 · Answer 4 · answered Aug 14 '09 at 09:02

62

You could try:

print ''.join(ch for ch in some_string if ch.isalnum())

answered Aug 14 '09 at 09:02

ars

120,335
23
147
134

lovely, beauty of python's simplicity! – sandeepsign Aug 27 '22 at 00:52

score 17 · Answer 5 · answered Aug 14 '09 at 09:01

17

>>> import re
>>> string = "Kl13@£$%[};'\""
>>> pattern = re.compile('\W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13

answered Aug 14 '09 at 09:01

DisplacedAussie

4,578
1
27
21

1

i loved your answer but it removes the Arabic chars too can you tell me how to keep them – Charif DZ Jan 06 '17 at 19:25

DrAl · Answer 6 · 2009-08-14T09:05:16.050

16

How about:

def ExtractAlphanumeric(InputString):
    from string import ascii_letters, digits
    return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])

This works by using list comprehension to produce a list of the characters in InputString if they are present in the combined ascii_letters and digits strings. It then joins the list together into a string.

edited Aug 14 '09 at 09:05

answered Aug 14 '09 at 08:58

DrAl

70,428
10
106
108

It seems that string.ascii_letters only contains letters (duh) and not numbers. I also need the numbers... – Mark van Lent Aug 14 '09 at 09:06
Adding string.digits would indeed solve the problem I just mentioned. :) – Mark van Lent Aug 14 '09 at 09:08
Yes, I realised that when I went back to read your question. Note to self: learn to read! – DrAl Aug 14 '09 at 09:21

score 9 · Answer 7 · answered Mar 03 '22 at 16:07

I checked the results with perfplot (a project of mine) and found that for short strings,

"".join(filter(str.isalnum, s))

is fastest. For long strings (200+ chars)

re.sub("[\W_]", "", s)

is fastest.

Code to reproduce the plot:

import perfplot
import random
import re
import string

pattern = re.compile("[\W_]+")


def setup(n):
    return "".join(random.choices(string.ascii_letters + string.digits, k=n))


def string_alphanum(s):
    return "".join(ch for ch in s if ch.isalnum())


def filter_str(s):
    return "".join(filter(str.isalnum, s))


def re_sub1(s):
    return re.sub("[\W_]", "", s)


def re_sub2(s):
    return re.sub("[\W_]+", "", s)


def re_sub3(s):
    return pattern.sub("", s)


b = perfplot.bench(
    setup=setup,
    kernels=[string_alphanum, filter_str, re_sub1, re_sub2, re_sub3],
    n_range=[2**k for k in range(10)],
)
b.save("out.png")
b.show()

Used this code and will let pass non alphanumeric chars, like russian text. — Ezequiel Adrian, Sep 02 '23 at 03:22

score 8 · Answer 8 · answered Dec 04 '18 at 15:27

8

sent = "".join(e for e in sent if e.isalpha())

answered Dec 04 '18 at 15:27

Tom Kalvijn

81
1
1

I'll try to explain: it goes through all string characters in `e for e in sent` and checks via `if e.isalpha()` statement if the current char is alphabetic symbol, if so - joins it to the `sent` variable via `sent = "".join()` and all non-alphabetic symbols will be replaced with `""` (empty string) because of `join` function. – Sysanin Sep 14 '19 at 11:53
since this is doing a loop per character rather than relying on C regex, isn't this extremely slow? – dcsan Dec 31 '19 at 10:00
I would prefer `e.alnum()` – Vishal Kumar Sahu Feb 11 '21 at 15:17

score 7 · Answer 9 · answered Mar 13 '17 at 12:54

7

As a spin off from some other answers here, I offer a really simple and flexible way to define a set of characters that you want to limit a string's content to. In this case, I'm allowing alphanumerics PLUS dash and underscore. Just add or remove characters from my PERMITTED_CHARS as suits your use case.

PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-" 
someString = "".join(c for c in someString if c in PERMITTED_CHARS)

answered Mar 13 '17 at 12:54

BuvinJ

10,221
5
83
96

3

Instead of hardcoding the permitted characters, which is prone to subtle errors, use `string.digits + string.ascii_letters + '_-'`. – Reti43 Oct 25 '17 at 22:43
Your suggestion is not wrong, but it also doesn't save many characters of "typing" if that's your goal. If you copy my post, you also won't have a typo! The real point, however, of my answer is allow an explicit, open-ended and simple means to define exactly which characters you want to allow. – BuvinJ Oct 25 '17 at 22:54
As a middle ground, you might combine these suggestions into `SPECIAL_CHARS = '_-'` and then use `string.digits + string.ascii_letters + SPECIAL_CHARS` – BuvinJ Oct 25 '17 at 22:56
It was a suggestion in terms of what is reasonable, unless we're doing code golf. "Walking" around the keyboard to type up 52 alphabet letters in order takes considerably longer than importing a package to use an object or two. And that doesn't include the time to double check you're typed it all up correctly. It's about good practices, that's all. – Reti43 Oct 25 '17 at 23:10
I hear you! My real point here is extreme flexibility, in case you want to get more specific with your character set. – BuvinJ Oct 25 '17 at 23:19

score 7 · Answer 10 · answered Jan 02 '19 at 00:12

Timing with random strings of ASCII printables:

from inspect import getsource
from random import sample
import re
from string import printable
from timeit import timeit

pattern_single = re.compile(r'[\W]')
pattern_repeat = re.compile(r'[\W]+')
translation_tb = str.maketrans('', '', ''.join(c for c in map(chr, range(256)) if not c.isalnum()))


def generate_test_string(length):
    return ''.join(sample(printable, length))


def main():
    for i in range(0, 60, 10):
        for test in [
            lambda: ''.join(c for c in generate_test_string(i) if c.isalnum()),
            lambda: ''.join(filter(str.isalnum, generate_test_string(i))),
            lambda: re.sub(r'[\W]', '', generate_test_string(i)),
            lambda: re.sub(r'[\W]+', '', generate_test_string(i)),
            lambda: pattern_single.sub('', generate_test_string(i)),
            lambda: pattern_repeat.sub('', generate_test_string(i)),
            lambda: generate_test_string(i).translate(translation_tb),

        ]:
            print(timeit(test), i, getsource(test).lstrip('            lambda: ').rstrip(',\n'), sep='\t')


if __name__ == '__main__':
    main()

Result (Python 3.7):

       Time       Length                           Code                           
6.3716264850008880  00  ''.join(c for c in generate_test_string(i) if c.isalnum())
5.7285426190064750  00  ''.join(filter(str.isalnum, generate_test_string(i)))
8.1875841680011940  00  re.sub(r'[\W]', '', generate_test_string(i))
8.0002205439959650  00  re.sub(r'[\W]+', '', generate_test_string(i))
5.5290945199958510  00  pattern_single.sub('', generate_test_string(i))
5.4417179649972240  00  pattern_repeat.sub('', generate_test_string(i))
4.6772285089973590  00  generate_test_string(i).translate(translation_tb)
23.574712151996210  10  ''.join(c for c in generate_test_string(i) if c.isalnum())
22.829975890002970  10  ''.join(filter(str.isalnum, generate_test_string(i)))
27.210196289997840  10  re.sub(r'[\W]', '', generate_test_string(i))
27.203713296003116  10  re.sub(r'[\W]+', '', generate_test_string(i))
24.008979928999906  10  pattern_single.sub('', generate_test_string(i))
23.945240008994006  10  pattern_repeat.sub('', generate_test_string(i))
21.830899796994345  10  generate_test_string(i).translate(translation_tb)
38.731336012999236  20  ''.join(c for c in generate_test_string(i) if c.isalnum())
37.942474347000825  20  ''.join(filter(str.isalnum, generate_test_string(i)))
42.169366310001350  20  re.sub(r'[\W]', '', generate_test_string(i))
41.933375883003464  20  re.sub(r'[\W]+', '', generate_test_string(i))
38.899814646996674  20  pattern_single.sub('', generate_test_string(i))
38.636144253003295  20  pattern_repeat.sub('', generate_test_string(i))
36.201238164998360  20  generate_test_string(i).translate(translation_tb)
49.377356811004574  30  ''.join(c for c in generate_test_string(i) if c.isalnum())
48.408927293996385  30  ''.join(filter(str.isalnum, generate_test_string(i)))
53.901889764994850  30  re.sub(r'[\W]', '', generate_test_string(i))
52.130339455994545  30  re.sub(r'[\W]+', '', generate_test_string(i))
50.061149017004940  30  pattern_single.sub('', generate_test_string(i))
49.366573111998150  30  pattern_repeat.sub('', generate_test_string(i))
46.649754120997386  30  generate_test_string(i).translate(translation_tb)
63.107938601999194  40  ''.join(c for c in generate_test_string(i) if c.isalnum())
65.116287978999030  40  ''.join(filter(str.isalnum, generate_test_string(i)))
71.477421126997800  40  re.sub(r'[\W]', '', generate_test_string(i))
66.027950693998720  40  re.sub(r'[\W]+', '', generate_test_string(i))
63.315361931003280  40  pattern_single.sub('', generate_test_string(i))
62.342320287003530  40  pattern_repeat.sub('', generate_test_string(i))
58.249303059004890  40  generate_test_string(i).translate(translation_tb)
73.810345625002810  50  ''.join(c for c in generate_test_string(i) if c.isalnum())
72.593953348005020  50  ''.join(filter(str.isalnum, generate_test_string(i)))
76.048324580995540  50  re.sub(r'[\W]', '', generate_test_string(i))
75.106637657001560  50  re.sub(r'[\W]+', '', generate_test_string(i))
74.681338128997600  50  pattern_single.sub('', generate_test_string(i))
72.430461594005460  50  pattern_repeat.sub('', generate_test_string(i))
69.394243567003290  50  generate_test_string(i).translate(translation_tb)

str.maketrans & str.translate is fastest, but includes all non-ASCII characters. re.compile & pattern.sub is slower, but is somehow faster than ''.join & filter.

score 5 · Answer 11 · answered Jul 08 '21 at 23:01

For a simple one-liner (Python 3.0):

''.join(filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped ))

For Python < 3.0:

filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped )

Note: you could add other characters to the allowed characters list if desired (e.g. '0123456789abcdefghijklmnopqrstuvwxyz.,_').

jslatane · Answer 12 · 2022-01-24T16:40:35.857

Python 3

Uses the same method as @John Machin's answer but updated for Python 3:

larger character set
slight changes to how translate works.

Python code is now assumed to be encoded in UTF-8
(source: PEP 3120)

This means the string containing all the characters you wish to delete gets much larger:

    
del_chars = ''.join(c for c in map(chr, range(1114111)) if not c.isalnum())

And the translate method now needs to consume a translation table which we can create with maketrans():

    
del_map = str.maketrans('', '', del_chars)

Now, as before, any string s you want to "scrunch":

    
scrunched = s.translate(del_map)

Using the last timing example from @Joe Machin, we can see it still beats re by an order of magnitude:

    
> python -mtimeit -s"d=''.join(c for c in map(chr,range(1114111)) if not c.isalnum());m=str.maketrans('','',d);s='foo-'*25" "s.translate(m)"
    
1000000 loops, best of 5: 255 nsec per loop
    
> python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
    
50000 loops, best of 5: 4.8 usec per loop

score 3 · Answer 13 · edited Nov 08 '18 at 07:08

3

for char in my_string:
    if not char.isalnum():
        my_string = my_string.replace(char,"")

edited Nov 08 '18 at 07:08

Mark van Lent

12,641
4
30
52

answered Oct 27 '18 at 06:36

Junior Ogun

39
3

score 2 · Answer 14 · answered Jan 07 '22 at 11:44

2

A simple solution because all answers here are complicated

filtered = ''
for c in unfiltered:
    if str.isalnum(c):
        filtered += c
    
print(filtered)

answered Jan 07 '22 at 11:44

Ahmed Tremo

137
1
3

score 0 · Answer 15 · answered Jun 02 '22 at 14:32

0

If you'd like to preserve characters like áéíóúãẽĩõũ for example, use this:

import re
re.sub('[\W\d_]+', '', your_string)

answered Jun 02 '22 at 14:32

Diogo de Toledo

1
1

score -3 · Answer 16 · answered Apr 11 '20 at 16:36

If i understood correctly the easiest way is to use regular expression as it provides you lots of flexibility but the other simple method is to use for loop following is the code with example I also counted the occurrence of word and stored in dictionary..

s = """An... essay is, generally, a piece of writing that gives the author's own 
argument — but the definition is vague, 
overlapping with those of a paper, an article, a pamphlet, and a short story. Essays 
have traditionally been 
sub-classified as formal and informal. Formal essays are characterized by "serious 
purpose, dignity, logical 
organization, length," whereas the informal essay is characterized by "the personal 
element (self-revelation, 
individual tastes and experiences, confidential manner), humor, graceful style, 
rambling structure, unconventionality 
or novelty of theme," etc.[1]"""

d = {}      # creating empty dic      
words = s.split() # spliting string and stroing in list
for word in words:
    new_word = ''
    for c in word:
        if c.isalnum(): # checking if indiviual chr is alphanumeric or not
            new_word = new_word + c
    print(new_word, end=' ')
    # if new_word not in d:
    #     d[new_word] = 1
    # else:
    #     d[new_word] = d[new_word] +1
print(d)

please rate this if this answer is useful!

While I think those that voted this down were being a little harsh. It would have been better of them to point out how slow this would be, and that this wouldn't be the most performant option. @otto-allmendinger . Answer gives more insight in to that — AppHandwerker, Sep 04 '21 at 14:41

Stripping everything but alphanumeric chars from a string in Python

16 Answers16

Python 3

Linked

Related