Regex to match capital/special/unicode/vietnamese characters

Question

I'm facing an issue. Indeed, I work with vietnamese texts and I want to find every word containing uppercase(s) (capital letter). When I use the 're' module, my function (temp) does not catch word like "Đà". The other way (temp2) is to check each character at a time, it works but it is slow since I have to split the sentences into words.

Hence I would like to know if there is a way of the "re" module to catch all the special capital letter.

I have 2 ways :

def temp(sentence):
    return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)


lis=word_tokenize(sentence)
def temp2(lis):
    proper_noun=[]
    for word in lis:
        for letter in word:
            if letter.isupper():
                proper_noun.append(word)
                break
    return proper_noun

Input:

'nous avons 2 Đồng et 3 Euro'

Expected output :

['Đồng','Euro']

Thank you!

BTW, are you sure your question is not already answered at https://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode ? — Paulo Scardine, Jun 26 '18 at 05:02
This is a duplicate of: https://stackoverflow.com/questions/36187349/python-regex-for-unicode-capitalized-words — blhsing, Jun 26 '18 at 05:25
@blhsing No it's not, he only wants to find the capital letters not the word containing those letters. — huseyin39, Jun 26 '18 at 05:28
Whether you're trying to find a word or a letter isn't what the problem is here. The problem here is using regex to identify capital letters in unicode, which is the same as the question I linked to. The rest of the letters in a word aren't posing any issue for you, so they don't count as part of the problem. — blhsing, Jun 26 '18 at 05:36
@blhsing Yes it matters, of course. Issue of speed of execution... If I use the method pointed out by you, I will have to split the sentence. — huseyin39, Jun 26 '18 at 05:43
No. What I mean is that the rest of the regex to match a word is easy because it does not concern a non-ASCII cased letter. Your question is easily answered by the question I linked to if you simply apply the solution there into your `temp` function, by replacing `[A-Z]` in your `r'[a-z]*[A-Z]+[a-z]*'` with one of the two solutions provided by the linked question. A question does not have to be identical in every detail to be a duplicate of another question; it simply has to be the same in the core issue. — blhsing, Jun 26 '18 at 05:49
@blhsing My bad, I did not know what "re.compile" did. So both solutions work. Thanks — huseyin39, Jun 26 '18 at 06:07
@WiktorStribiżew The data with which I work does not have this kind of data but I agree that your solution is more complete. — huseyin39, Jun 26 '18 at 15:56
Possible duplicate of [Python regex for unicode capitalized words](https://stackoverflow.com/questions/36187349/python-regex-for-unicode-capitalized-words) — tripleee, Aug 17 '18 at 09:53

score 5 · Accepted Answer · answered Jun 26 '18 at 05:07

You may use this regex:

\b\S*[AĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴA-Z]+\S*\b

Regex Demo

As pointed out by Wiktor Stribiwez, you should change '\S' to '[^\W\d_]' to not catch '+',... — huseyin39, Jun 26 '18 at 16:13

score 1 · Answer 2 · answered Jun 26 '18 at 05:55

The answer of @Rizwan M.Tuman is correct. I want to share with you the speed of execution of the three functions for 100,000 sentences.

lis=word_tokenize(sentence)
def temp(lis):
    proper_noun=[]
    for word in lis:
        for letter in word:
            if letter.isupper():
                proper_noun.append(word)
                break
    return proper_noun

def temp2(sentence):
    return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)

def temp3(sentence):
    return re.findall(capital_letter,sentence)

By this way:

start_time = time.time()
for k in range(100000):
    temp2(sentence)
print("%s seconds" % (time.time() - start_time))

Here are the results:

>>Check each character of a list of words if it is a capital letter (.isupper())
(sentence has already been splitted into words)
0.4416656494140625 seconds

>>Function with re module which finds normal capital letters [A-Z] :
0.9373950958251953 seconds

>>Function with re module which finds all kinds of capital letters :
1.0783331394195557 seconds

score 1 · Answer 3 · answered Jun 26 '18 at 10:29

To match only 1+ letter chunks that contain at least 1 uppercase Unicode letter you may use

import re, sys, unicodedata

pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
p = re.compile(r"[^\W\d_]*{Lu}[^\W\d_]*".format(Lu=pLu))

sentence = 'nous avons 2 Đồng et 3 Ęułro.+++++++++++++++Next line'
print(p.findall(sentence))
# => ['Đồng', 'Ęułro', 'Next']

The pLu is a Unicode letter character class pattern built dynamically using unicodedata. It is dependent on the Python version, use the latest to include as many Unicode uppercase letters as possible (see this answer for more details, too). The [^\W\d_] is a construct matching any Unicode letter. So, the pattern matches any 0+ Unicode letters, followed with at least 1 Unicode uppercase letter, and then having any 0+ Unicode letters.

Note that your original r'[a-z]*[A-Z]+[a-z]*' will only find Next in this input:

print(re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)) # => ['Next']

See the Python demo

To match the words as whole words, use \b word boundary:

p = re.compile(r"\b[^\W\d_]*{Lu}[^\W\d_]*\b".format(Lu=pLu))

In case you want to use Python 2.x, do not forget to use re.U flag to make the \W, \d and \b Unicode aware. However, it is recommended to use the latest PyPi regex library and its [[:upper:]] / \p{Lu} constructs to match uppercase letters since it will support the up-to-date list of Unicode letters.

Firstly thank you. Your answer is more convenient for all kinds of language. However my question is focused on Vietnamese. In addition, this method is slow, 20 times slower than the other one. I have taken into account the detection of Unicode letter. — huseyin39, Jun 26 '18 at 16:19
@huseyin39 I see. Still, there is something we all missed about Vietnamese: if the letters are multibyte chars, the `[...]` based regex won't work as those letters are made by base (Latin) letters and diacritics. You would need to "cook" a more complex pattern if you want to match them. — Wiktor Stribiżew, Jun 26 '18 at 16:22
You are right about the size of vietnamese character ('ẵ' is 76 bytes when 'a' is 50). But, does it matter? — huseyin39, Jun 27 '18 at 11:11
@huseyin39 You will need to add them as alternatives in a grouping construct: `([AĂÂÁẮẤÀ.. and the rest of precomposed letters]|first_multibyte_letter|...|nth_multibyte_letter)`. — Wiktor Stribiżew, Jun 27 '18 at 11:18

Regex to match capital/special/unicode/vietnamese characters

3 Answers3