You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
You don't need codecs.encode(unicode_string, 'unicode-escape')
in this case. There are no string literals in memory only string objects.
Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç'
could be written as u'\u00c7'
and u'\u0043\u0327'
.
You could use NFKD
Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata
s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))
Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like @Paulo Freitas's answer)
re.sub('c+', 'c', text)
makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c'
with 'c'
. But the result is the same: no consecutive duplicate 'c'
in the text.
The regex from @Paulo Freitas's answer should also work:
no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))
It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.