0

My goal is to write a function that inputs a text and substitutes all characters except for latin alphabet (A-z) with whitespaces, plus it deletes all the words containing digits. Then it replaces all multiple whitespace with a single one.

Example:

' hello, world! ho1hoho2ho, merry xmas!! ho1ho1 :))' -> 'hello world merry xmas'. 

The Python function that implements this:

def clean_text(text):
    text_valid = re.sub(u'[^A-z0-9]', ' ', text)
    return ' '.join(word for word in text_valid.split()
                    if not re.search(r'\d', word))

Now I wonder if there is a single regular expression for this, maybe, so I could just write something like

return ' '.join(re.findall(enter_my_magical_regex_here))

Or, maybe, there is another way to replace the code above with something faster (and, hopefully, shorter)?

Ramil
  • 125
  • 6

2 Answers2

2

You may use

' '.join(re.sub('([^A-Za-z0-9 ]|[^ ]*[0-9][^ ]*)', '', text).split())
Shanavas M
  • 1,581
  • 1
  • 17
  • 24
1

This will get you your desired output -

x = ' hello, world! ho1hoho2ho, merry xmas!! ho1ho1 :))'
re.sub('[!,]', '', ' '.join([i for i in x.split() if not re.findall('[\d+:\\?\"<>*/|]', i)]))

But you might have to tweak things here and there

Sushant
  • 3,499
  • 3
  • 17
  • 34