I'm facing an issue. Indeed, I work with vietnamese texts and I want to find every word containing uppercase(s) (capital letter). When I use the 're' module, my function (temp) does not catch word like "Đà". The other way (temp2) is to check each character at a time, it works but it is slow since I have to split the sentences into words.
Hence I would like to know if there is a way of the "re" module to catch all the special capital letter.
I have 2 ways :
def temp(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
lis=word_tokenize(sentence)
def temp2(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
Input:
'nous avons 2 Đồng et 3 Euro'
Expected output :
['Đồng','Euro']
Thank you!