Python, remove all non-alphabet chars from string

Question

I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it

def mapfn(k, v):
    print v
    import re, string 
    pattern = re.compile('[\W_]+')
    v = pattern.match(v)
    print v
    for w in v.split():
        yield w, 1

I'm afraid I am not sure how to use the library re or even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) v properly to retrieve the new line without any non-alphanumeric chars.

Suggestions?

`v` is an entire line of a book (specifically moby dick), I am going word by word not char by char. So some words might have a "," at the end so "indignity," does not map with "indignity". — KDecker, Mar 20 '14 at 00:34
Possible duplicate of [Stripping everything but alphanumeric chars from a string in Python](http://stackoverflow.com/questions/1276764/stripping-everything-but-alphanumeric-chars-from-a-string-in-python) — sds, Nov 03 '16 at 20:23
Lolx - did you get the same pre-interview home exercise as me? Find the 50 most used words in Moby Dick and report their frequency. I did it in C++, IIRC — Mawg says reinstate Monica, Feb 13 '17 at 15:48
@Mawg It was an exercise in my undergrad "Cloud Computing" class. — KDecker, Feb 13 '17 at 17:46

score 174 · Accepted Answer · edited Dec 18 '19 at 02:42

174

Use re.sub

import re

regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'

Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)

regex = re.compile('[,\.!?]') #etc.

edited Dec 18 '19 at 02:42

Jon-Eric

16,977
9
65
97

answered Mar 20 '14 at 00:36

limasxgoesto0

4,555
8
31
38

1

Hmm, I can quite track it down, but what about the pattern to remove all non-alphanumerics excluding spaces? – KDecker Mar 20 '14 at 00:45
2

Just add a space into your collection class. i.e. ```^a-zA-Z ``` instead of just ```^a-zA-Z``` – limasxgoesto0 Mar 20 '14 at 00:46
Unless you're also worried about newlines, in which case ```a-zA-Z \n```. I'm trying to find a regex that would lump both into one but using ```\w``` or ```\W``` isn't giving me the desired behavior. You might just need to add ```\n``` if that's the case. – limasxgoesto0 Mar 20 '14 at 00:51
1

Ahh, the newline char. Thats where my issues lies, I was comparing my results to given results and I was still off. I think that's my issue! Thanks // Hmm, I tried it with the newline char same results, I think there is another I am missing.. // Duhhh... Upper and lower case... // Thanks for all the help, works perfectly now! – KDecker Mar 20 '14 at 00:54

score 72 · Answer 2 · answered Mar 30 '15 at 15:54

72

If you prefer not to use regex, you might try

''.join([i for i in s if i.isalpha()])

answered Mar 30 '15 at 15:54

Tad

4,668
34
35

how do I join this? with ''.join ? printing s gets only a filter object – PirateApp Apr 22 '18 at 11:41
1

Wow, this is what i was looking. This takes into account kanji, hiragana, katakana,etc. kudos – eroot163pi Apr 15 '20 at 08:04

Don · Answer 3 · 2020-02-28T08:01:34.777

43

Try:

s = ''.join(filter(str.isalnum, s))

This will take every char from the string, keep only alphanumeric ones and build a string back from them.

edited Feb 28 '20 at 08:01

answered Jan 05 '15 at 05:16

Don

16,928
12
63
101

1

This good, because it can handle strange characters like Å Å Ö – Markus Kaukonen Sep 24 '21 at 12:01
3

if someone does not want to keep numbers use `isalpha`instead of `isalnum`. And if you want to keep spaces you can do `''.join(filter(lambda x: x.isalpha() or x.isspace(), s))` – raquelhortab Jul 15 '22 at 14:44

score 40 · Answer 4 · answered Mar 20 '14 at 00:43

40

You can use the re.sub() function to remove these characters:

>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'

re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)

"[^a-zA-Z]+" - look for any group of characters that are NOT a-zA-z.
"" - Replace the matched characters with ""

answered Mar 20 '14 at 00:43

Kevin

2,112
14
15

Note that this will also remove accented letters: ãâàáéèçõ, etc. – Brad Ahrens Jun 15 '20 at 09:20

score 9 · Answer 5 · answered Apr 22 '18 at 11:49

The fastest method is regex

#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)

""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)

#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)

#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)


2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join

Wiktor Stribiżew · Answer 6 · 2021-10-14T08:19:56.733

It is advisable to use PyPi regex module if you plan to match specific Unicode property classes. This library has also proven to be more stable, especially handling large texts, and yields consistent results across various Python versions. All you need to do is to keep it up-to-date.

If you install it (using pip install regex or pip3 install regex), you may use

import regex
print ( regex.sub(r'\P{L}+', '', 'ABCŁąć1-2!Абв3§4“5def”') )
// => ABCŁąćАбвdef

to remove all chunks of 1 or more characters other than Unicode letters from text. See an online Python demo. You may also use "".join(regex.findall(r'\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”')) to get the same result.

In Python re, in order to match any Unicode letter, one may use the [^\W\d_] construct (Match any unicode letter?).

So, to remove all non-letter characters, you may either match all letters and join the results:

result = "".join(re.findall(r'[^\W\d_]', text))

Or, remove all chars matching the [\W\d_] pattern (opposite to [^\W\d_]):

result = re.sub(r'[\W\d_]+', '', text)

See the regex demo online. However, you may get inconsistent results across various Python versions because the Unicode standard is evolving, and the set of chars matched with \w will depend on the Python version. Using PyPi regex library is highly recommended to get consistent results.

score 1 · Answer 7 · answered Dec 14 '22 at 00:45

Here's yet another callable function that removes every that is not in plain english:

import re
remove_non_english = lambda s: re.sub(r'[^a-zA-Z\s\n\.]', ' ', s)

Usage:

remove_non_english('a€bñcá`` something. 2323')
> 'a b c    something     '

Python, remove all non-alphabet chars from string

7 Answers7

Linked

Related