0

I need to match a string of words such as "Miguel Tiago", but do not want strings that may contain numbers. For example I do not want strings such as "Miguel10 Tiago". Also all strings may contain unicode.

I can't simple do:

re.match(ur'^[a-zA-Z ]+',string,re.UNICODE)

because for such case words with unicode such as 'ç' won't be recognized.

How can I use the rule [\w ]+ and exclude the digits?

Miguel
  • 2,738
  • 3
  • 35
  • 51

1 Answers1

0

I think the solution you are looking for is: (([^\W\d]+(?![\w\d])\s?)+)

Roars
  • 623
  • 5
  • 17
  • Thanks and sorry but I need to edit the question to me more clear. – Miguel Nov 14 '17 at 11:40
  • Ok, I edit the question. Also added a detail I forgot to put, which the need to recognize more that one word. – Miguel Nov 14 '17 at 11:45
  • Surely in this case you could just do something with the `\D` (not digit) character class. Just to be clear, you don't want it to be able to include symbols just unicode characters. – Roars Nov 14 '17 at 11:46
  • I believe this should work: `([^\W\d]+(?![\w\d]))`. Let me know and I will write it up properly with an explanation. This for multiple words: `(([^\W\d]+(?![\w\d]))\s)*` – Roars Nov 14 '17 at 12:01
  • Yes, I think that's it. The only thing now is that the multiple words is not working: `re.match(ur'(([^\W\d]+(?![\w\d]))\s)*',u'Miguel Tiago',re.UNICODE).group()` gives `Miguel ` – Miguel Nov 14 '17 at 12:05
  • Try `(([^\W\d]+(?![\w\d])\s?)+)` realised that you need to specify that there may or may not be a space after it. – Roars Nov 14 '17 at 12:40
  • I've realised there is a flaw in this, that if the numbers appear within the word this won't work as expected – Roars Nov 14 '17 at 12:49