0

I am tokenizing a string into words and then want to remove any word which contains a number.

tokens = ['hello', 'world', '12', '1-3', '23'']

As you can see, the numbers come in various forms. The above three are just examples. I can loop through the string items and see if there is a digit and remove that string. However, that doesn't seem right.

The isdigit() function doesn't work on such number-strings. How can I achieve this?

Goal: Any token which contains a digit should be removed. my current code is something like this which doesn't handle the above types:

relevant_tokens = [token for token in tokens if not token.isdigit()]
utengr
  • 3,225
  • 3
  • 29
  • 68
  • 6
    [`relevant_tokens = [token for token in tokens if not any(c.isdigit() for c in token)]`](https://ideone.com/WYIxED)? – Wiktor Stribiżew Oct 16 '17 at 10:37
  • This can help you : https://stackoverflow.com/q/30141233/5596800 – xssChauhan Oct 16 '17 at 10:37
  • import re; result = [token for token in tokens if len(re.findall("\d+", token))==0] – Kinght 金 Oct 16 '17 at 10:43
  • @WiktorStribiżew that works and I mentioned that approach in the question when I said: "I can loop through the string item". However, it makes my filter statement too complex. I was more looking for a single function. – utengr Oct 16 '17 at 10:50
  • 1
    Ok, the first thread linked actually contains the right regex solution, `re.search(r'\d', inputString)`. Do not use the `re.match('.*\d+', token)` solution below, it will cause unnecessary backtracking and slow down. – Wiktor Stribiżew Oct 16 '17 at 10:51
  • I'll use that option. – utengr Oct 16 '17 at 10:59

1 Answers1

0
import re
tokens = [token for token in tokens if not re.match('.*\d+', token)]
MohitC
  • 4,541
  • 2
  • 34
  • 55