Strip all non-numeric characters (except for ".") from a string in Python

Question

I've got a pretty good working snippit of code, but I was wondering if anyone has any better suggestions on how to do this:

val = ''.join([c for c in val if c in '1234567890.'])

What would you do?

Because I was swept here by a web search I just wanted to add that people must not forget to add `-` for their own code if negative numbers can occur. — Christian, Feb 11 '13 at 20:17

Miles · Accepted Answer · 2017-03-18T01:38:38.083

128

You can use a regular expression (using the re module) to accomplish the same thing. The example below matches runs of [^\d.] (any character that's not a decimal digit or a period) and replaces them with the empty string. Note that if the pattern is compiled with the UNICODE flag the resulting string could still include non-ASCII numbers. Also, the result after removing "non-numeric" characters is not necessarily a valid number.

>>> import re
>>> non_decimal = re.compile(r'[^\d.]+')
>>> non_decimal.sub('', '12.34fe4e')
'12.344'

edited Mar 18 '17 at 01:38

answered Jun 03 '09 at 23:14

Miles

31,360
7
64
74

the reg-ex would I guess be faster! – g06lin Jun 03 '09 at 23:27
15

+1 for including the quantifier. Note that you don't need to compile the pattern in this case; Python caches recently-used patterns. Instead, just use `re.sub(r'[^\d.]+', '', '12.34fe4e')` – Ben Blank Jun 03 '09 at 23:39
3

Python does cache recently used patterns (the last 100, if memory serves), but I like the compile here, simply because you can refer to the pattern by a reasonable name instead of mentally decoding the regex every time you read the code. – Kenan Banks Jun 04 '09 at 16:52
Code breaks for 355.fhfg55.ty55g – Pranav Waila Mar 17 '17 at 10:22
@PranavWaila What do you expect the result to be for that? – Miles Mar 18 '17 at 03:11
If it's a decimal number, only one decimal point is possible. – Pranav Waila Mar 19 '17 at 19:55

score 20 · Answer 2 · answered Jun 04 '09 at 06:24

20

Another 'pythonic' approach

filter( lambda x: x in '0123456789.', s )

but regex is faster.

answered Jun 04 '09 at 06:24

maxp

5,454
7
28
30

score 16 · Answer 3 · edited Jul 20 '18 at 13:01

16

A simple solution is to use regular expessions

import re 
re.sub("[^0-9^.]", "", data)

edited Jul 20 '18 at 13:01

Steven C. Howell

16,902
15
72
97

answered Feb 22 '16 at 11:34

Midhun Mohan

552
5
18

why isnt this the answer? – eljusticiero67 Feb 15 '23 at 15:52

score 15 · Answer 4 · edited Mar 22 '13 at 11:04

Here's some sample code:

$ cat a.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    ''.join([c for c in a if c in '1234567890.'])

$ cat b.py
import re

non_decimal = re.compile(r'[^\d.]+')

a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    non_decimal.sub('', a)

$ cat c.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    ''.join([c for c in a if c.isdigit() or c == '.'])

$ cat d.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    b = []
    for c in a:
        if c.isdigit() or c == '.': continue
        b.append(c)

    ''.join(b)

And the timing results:

$ time python a.py
real    0m24.735s
user    0m21.049s
sys     0m0.456s

$ time python b.py
real    0m10.775s
user    0m9.817s
sys     0m0.236s

$ time python c.py
real    0m38.255s
user    0m32.718s
sys     0m0.724s

$ time python d.py
real    0m46.040s
user    0m41.515s
sys     0m0.832s

Looks like the regex is the winner so far.

Personally, I find the regex just as readable as the list comprehension. If you're doing it just a few times then you'll probably take a bigger hit on compiling the regex. Do what jives with your code and coding style.

You can do these microbenchmarks a little more easily (and accurately) using the timeit module. For example: $ python -m timeit -s "import re; non_decimal = re.compile(r'[^\d.]+'); a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'" "non_decimal.sub('', a)" — Miles, Jun 03 '09 at 23:52
That gets me 10.7 us. 10.775s for 1e6 loops is close enough to 10.7 us. :) — Colin Burnett, Jun 04 '09 at 00:41
+1 for taking context into consideration "Do what jives with your code and coding style" — adam, Jun 04 '09 at 13:48

score 3 · Answer 5 · answered Jan 03 '12 at 21:15

3

import string
filter(lambda c: c in string.digits + '.', s)

answered Jan 03 '12 at 21:15

Josh Bothun

1,324
9
9

score 2 · Answer 6 · answered Jun 04 '09 at 16:49

If the set of characters were larger, using sets as below might be faster. As it is, this is a bit slower than a.py.

dec = set('1234567890.')

a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    ''.join(ch for ch in a if ch in dec)

At least on my system, you can save a tiny bit of time (and memory if your string were long enough to matter) by using a generator expression instead of a list comprehension in a.py:

a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    ''.join(c for c in a if c in '1234567890.')

Oh, and here's the fastest way I've found by far on this test string (much faster than regex) if you are doing this many, many times and are willing to put up with the overhead of building a couple of character tables.

chrs = ''.join(chr(i) for i in xrange(256))
deletable = ''.join(ch for ch in chrs if ch not in '1234567890.')

a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    a.translate(chrs, deletable)

On my system, that runs in ~1.0 seconds where the regex b.py runs in ~4.3 seconds.

Strip all non-numeric characters (except for ".") from a string in Python

6 Answers6

Linked