10

I am trying to learn Python, so I thought I would start by trying to query IMDB to check my movie collection against IMDB; which was going well

What I am stuck on is how to handle special characters in names, and encode the name to something a URL will respect.

For example I have the movie Brüno

If I encode the string using urllib.parse.quote I get - Bru%CC%88no which means when I query IMDB using OMDBAPI it fails to find the movie. If I do the search via the OMDBAPI site, they encode the name as Br%C3%BCno and this search works.

I am assuming that the encode is using a different standard, but I can’t work out what I need to do

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
PhilC
  • 123
  • 8
  • 1
    This is actually a bug on IMDB’s side (this doesn’t immediately help *you*, of course). – Konrad Rudolph Mar 22 '19 at 14:36
  • @KonradRudolph - just out of curiosity, why would this be considered a bug on IMDB's side? Is there a standard or other reason that NFC form should not be used in url encoding? – benvc Mar 22 '19 at 14:51
  • 2
    @benvc The search API shouldn’t assume any one input normalisation, and instead perform its own, or otherwise ensure that relevant results are found regardless of how input is normalised: The Unicode standard is entirely clear on this: “Brüno” = “Brüno”, even if the first string uses a dedicated codepoint and the second uses a combining diacritic. Comparison must happen on the level of grapheme clusters, not on the level of bytes or codepoints. – Konrad Rudolph Mar 22 '19 at 15:07
  • 1
    @KonradRudolph - makes sense, thanks for taking the time to explain. – benvc Mar 22 '19 at 15:40

1 Answers1

8

It is using the same encoding, but using different normalizations.

>>> import unicodedata
>>> "Brüno".encode("utf-8")
b'Bru\xcc\x88no'
>>> unicodedata.normalize("NFC", "Brüno").encode("utf-8")
b'Br\xc3\xbcno'

Some graphemes (things you see as one "character"), especially those with diacritics can be made from different characters. An "ü" can either be a "u", with a combining diaresis, or the character "ü" itself (the combined form). Combined forms don't exist for every combination of letter and diacritic, but they do for commonly used ones (= those existing in common languages).

Unicode normalization transforms all characters that form graphemes into either combined or seperate characters. The normalization method "NFC", or Normalization Form Canonical Composition, combines characters as far as possible.

In comparison, the other main form, Normalization Form Canonical Decomposition, or "NFD" will produce your version:

>>> unicodedata.normalize("NFD", "Brüno").encode("utf-8")
b'Bru\xcc\x88no'
L3viathan
  • 26,748
  • 2
  • 58
  • 81
  • Technically the result of a normalisation *is* an encoding. “Encoding” in the context of Unicode (as in your answer) often refers to the different UTFs (i.e. the encoding of code points) but the meaning of encoding is more general. `Br%C3%BCno` and `Bru%CC%88no` are two different encoded forms of the information “Brüno”. – Konrad Rudolph Mar 22 '19 at 14:39
  • 3
    As a minor addition that may be useful to some, you can also explicitly get the comparable NFD output, `b'Bru\xcc\x88no'`, using `unicodedata.normalize("NFD", "Brüno").encode("utf-8")` to illustrate which normalization form is being used in each instance. – benvc Mar 22 '19 at 14:46
  • @KonradRudolph Yes and no: The result of normalisation is still an instance of `str`, not `bytes`, it only changes which unicode codepoints are used. Nothing is declared about how these are represented yet. But yes, all strings in a computer are encoded somehow, even "Unicode strings". – L3viathan Mar 22 '19 at 14:47