Regex for removing all characters except A-z and deleting all words containing digits

Question

My goal is to write a function that inputs a text and substitutes all characters except for latin alphabet (A-z) with whitespaces, plus it deletes all the words containing digits. Then it replaces all multiple whitespace with a single one.

Example:

' hello, world! ho1hoho2ho, merry xmas!! ho1ho1 :))' -> 'hello world merry xmas'.

The Python function that implements this:

def clean_text(text):
    text_valid = re.sub(u'[^A-z0-9]', ' ', text)
    return ' '.join(word for word in text_valid.split()
                    if not re.search(r'\d', word))

Now I wonder if there is a single regular expression for this, maybe, so I could just write something like

return ' '.join(re.findall(enter_my_magical_regex_here))

Or, maybe, there is another way to replace the code above with something faster (and, hopefully, shorter)?

FYI: [`[A-z]` also matches some non-letter chars.](https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret/29771926#29771926) — Wiktor Stribiżew, Aug 14 '18 at 12:27
That character set should be `[^0-9A-Za-z]` to match everything but ASCII digits and letters. (`\W` would let underscores through.) — Kevin J. Chase, Aug 14 '18 at 12:28
Do not use `[A-я]` to match Russian letters, you need to use `[А-Яа-яёЁ]`. English ones can be matched with `[a-zA-Z]`. — Wiktor Stribiżew, Aug 14 '18 at 12:34
If you want to match only letters in the current locale, use a predefined character class for that, don't try to write one yourself. — Charles Duffy, Aug 14 '18 at 12:36
Try `re.sub(ur'^\s*|\s*$|\s*(?:[^\W\d_]*\d[^\W\d_]*|[^A-Za-z\d\s]+)', '', text, flags=re.U)` — Wiktor Stribiżew, Aug 14 '18 at 12:42
...so, for instance, `[[:alpha:]]` will understand the Russian alphabet when you're currently in a locale where that's in your character collation order. — Charles Duffy, Aug 15 '18 at 03:30

Shanavas M · Accepted Answer · 2018-08-14T12:47:02.580

2

You may use

' '.join(re.sub('([^A-Za-z0-9 ]|[^ ]*[0-9][^ ]*)', '', text).split())

edited Aug 14 '18 at 12:47

answered Aug 14 '18 at 12:45

Shanavas M

1,581
1
17
24

What's the use of `u`? – Sushant Aug 14 '18 at 12:46

score 1 · Answer 2 · answered Aug 14 '18 at 12:40

1

This will get you your desired output -

x = ' hello, world! ho1hoho2ho, merry xmas!! ho1ho1 :))'
re.sub('[!,]', '', ' '.join([i for i in x.split() if not re.findall('[\d+:\\?\"<>*/|]', i)]))

But you might have to tweak things here and there

answered Aug 14 '18 at 12:40

Sushant

3,499
3
17
34

Regex for removing all characters except A-z and deleting all words containing digits

2 Answers2