Checking for a substring in unicode value

Question

Suppose I have a variable that has a unicode value in a Python script.

 place_name = u'K\u016bla Mountain'

In this instance, 016b denotes that a macron accent mark is used over the u. I want to check for '016b' in the substring and if found, change place_name to u'Kula Mountain'. If it was just a string, I could use:

if '016b' in place_name:
    place_name = 'Kula Mountain'

But that won't work with the unicode value. Whats the simplest way to check for '016b' and if found, change place_name to uncode value of u'Kula Mountain'?

Note, I tried:

 if '016b' in ord(alt_map_name):
      place_name = u'Kula Mountain'

as suggested by other posts on this issue, but got

Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: ord() expected a character, but string of length 16 found

EDIT: To be clear, I just want to check for the macron (0x016b), be it with a 'u' or any other letter.

`place_name.replace(u"\u016b", "u")` should work, too. (you do not want to replace _all_ place names that have that letter with "Kula Mountain", right?) — tobias_k, Aug 24 '23 at 14:11
Does this answer your question? [What is the best way to remove accents (normalize) in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string) — Reinderien, Aug 24 '23 at 14:11
@tobais_k - Well, my example is simplified, but I need to check for the `016b' first before I replace anything. — gwydion93, Aug 24 '23 at 14:14
Kelly Bundy - I am not trying to destroy information. There are some complicated conversion issues involving GIS, fonts, and other things and a decision was made to simply remove the macron in certain instances (which is why I need to check for it first). I am not trying to create an ethical or philosophical discussion on accent marks, just need some helpful feedback on my question. — gwydion93, Aug 24 '23 at 14:29
Reinderien - as far as I could see, that post only covered removing diacritics, not checking for them. I need to check that its there first. — gwydion93, Aug 24 '23 at 14:30
"In this instance, 016b denotes that a macron accent mark is used over the u" - that's not what it means. `\u016b` means the Unicode code point with hex value 016b. That character happens to be ū, but the fact that it's an accented "u" is a complete coincidence. — user2357112, Aug 24 '23 at 14:31
Well the information is there and you want to remove it. But ok, if GIS etc truly can't handle it, then that's decent reason. Isn't clear from your question, hence I asked. Plenty of people are confused about string representations and ask for things they don't actually need. — Kelly Bundy, Aug 24 '23 at 14:37
user2357112 - point taken and forgive my unfamiliarity with these diacritics. How would you propose checking for a macron in unicode then? Assume the macron could hypothetically be with a u, or any other letter. — gwydion93, Aug 24 '23 at 14:37
"I need to check for the `016b' first" Why? If it's not there, replace or normalize won't do any harm. Maybe show a more complete example for context? — tobias_k, Aug 24 '23 at 14:42
tobias_k - I tried to craft my question in a generic sense, but the main reason is because I don't want to remove ALL diacritic values from the unicode, just the macron, which is why I needed to check for 016b first. — gwydion93, Aug 24 '23 at 14:44
Noted. I updated my original question to clarify. Sorry for the confusion. — gwydion93, Aug 24 '23 at 14:50

user2357112 · Answer 1 · 2023-08-25T01:31:19.980

3

You've misinterpreted your input.

The 016b does not denote a macron over the "u". Instead, \u016b is an escape sequence, representing the Unicode code point with hex value 016b. That code point happens to be ū, U+016B LATIN SMALL LETTER U WITH MACRON, but the fact that it's an accented "u" has nothing to do with the "u" in the escape sequence.

Your string does not have a 0, 1, 6, or b in it. The string literal you wrote has those characters in it, but the string it evaluates to has a ū character in it. Searching for "016b" will not find a match.

If you want to remove macrons from your input, you can apply canonical decomposition to transform the composed character into separate "u" and combining macron (U+0304 COMBINING MACRON) code points, then remove the combining macrons:

import unicodedata

# NFD normalization applies canonical decomposition, splitting apart composed characters
decomposed_place_name = unicodedata.normalize('NFD', place_name)

# \N escape sequences let you refer to a code point by name.
# Alternatively, you could use '\u0304' to refer to it by hex numeric value.
place_name_without_macrons = decomposed_place_name.replace('\N{COMBINING MACRON}', '')

edited Aug 25 '23 at 01:31

answered Aug 24 '23 at 14:47

user2357112

260,549
28
431
505

Nice, this seems to do exactly what OP wants, and I learned something new, too. – tobias_k Aug 24 '23 at 14:52
To make the code a bit more generic so it can handle other diacritics, `import regex as re` then normalise as above, then `place_name_without_diacritics = re.sub(r'\p{Mn}', '', decomposed_place_name)` – Andj Aug 25 '23 at 00:09
@Andj: Apparently they don't want to remove other diacritics, just macrons. That's useful for other people's use cases, though. – user2357112 Aug 25 '23 at 00:12
@user2357112, Thanks, that was the purpose of adding a generic solution. But didn't feel it was worth its own answer since your solution was sufficient for the question, I was aware the question specifically referred to macrons, although, the presence of a macron is intriguing, not what I would have expected. – Andj Aug 25 '23 at 01:46

vegan_meat · Accepted Answer · 2023-08-24T14:32:54.490

-3

place_name = u'K\u016bla Mountain'


if 0x016b in [ord(c) for c in place_name]:
    place_name = u'Kula Mountain'
print(place_name)

output:-

Kula Mountain

In your case, 0x016b represents the Unicode code point for the character 'u' and ord() take single character as an argument.so, you can use list comprehension in this

edited Aug 24 '23 at 14:32

answered Aug 24 '23 at 14:27

vegan_meat

878
4
10

1

`0x016b in [ord(c) for c in place_name]` is an overly complicated way of saying `u'\u016b' in place_name` – Steven Rumbalski Aug 24 '23 at 14:37
Hey Steven - Your solution does work; however I just need to check for a macron period and u'\u016b'u' is specific to the 'u'. I tried just using u'016b', but that didn't quite work. – gwydion93 Aug 24 '23 at 14:42
This solution works nicely. Thanks for the feedback! – gwydion93 Aug 24 '23 at 14:45
Are you going to hardcode every possible string like that? And why would you change every place that happens to have a 'ū' to Kula Mountain? – Kelly Bundy Aug 24 '23 at 14:51
3

Sorry, but no, it does not work nicely, or does not do what you want or think it does, because `0x016b` is _not_ any macron, but specifically "u with macron". – tobias_k Aug 24 '23 at 14:52

Checking for a substring in unicode value

2 Answers2