6

Python2.7: I would like to do something unusual. Most people want to convert string literals to more human-readable strings. I would like to convert the following list of unicode strings into their literal forms:

  • hallöchen
  • Straße
  • Gemüse
  • freø̯̯nt

to their codepoint forms that look something like this:

\u3023\u2344

You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.

I am not sure what the terminology is for these things—please correct me if I am mistaken.

Jonathan Komar
  • 2,678
  • 4
  • 32
  • 43

2 Answers2

4

You can use the str.encode([encoding[, errors]]) function with the unicode_escape encoding:

>>> s = u'freø̯̯nt'
>>> print(s.encode('unicode_escape'))
b'fre\\xf8\\u032f\\u032fnt'
Anym
  • 195
  • 1
  • 9
  • thanks but I get this error: `SyntaxError: Non-ASCII character '\xc3' in file ~/Python/unicode to raw literals.py on line 1, but no encoding declared;` – Jonathan Komar Dec 26 '13 at 10:44
  • 1
    @macmadness86: Well, which encoding is your editor using? In Python 2, ASCII is the default source code encoding - if you use something else, you need to explicitly declare it at the top of the script. – Tim Pietzcker Dec 26 '13 at 11:43
  • 1
    @TimPietzcker That was it! I just added `# -*- coding: utf-8 -*-`to the top of the file and it worked. Thanks! I assume this works because I have unicode characters in my script: `freø̯̯nt` – Jonathan Komar Dec 26 '13 at 11:56
2

You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.

You don't need codecs.encode(unicode_string, 'unicode-escape') in this case. There are no string literals in memory only string objects.

Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç' could be written as u'\u00c7' and u'\u0043\u0327'.

You could use NFKD Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata

s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))

Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like @Paulo Freitas's answer)

re.sub('c+', 'c', text) makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c' with 'c'. But the result is the same: no consecutive duplicate 'c' in the text.

The regex from @Paulo Freitas's answer should also work:

no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))

It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • I was not aware of the term, `codepoint`. Then, this is what I wanted to do: convert a unicode string object into its equivalent string using codepoints. Is that correct? – Jonathan Komar Dec 26 '13 at 11:50
  • 1
    @macmadness86 if you're not familiar with codepoints, you probably want to read [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html). – lvc Dec 26 '13 at 11:57
  • 1
    @macmadness86: no. It is not correct. You convert one sequence of (Unicode) codepoints into another. Compare `unicodedata.normalize('NFC', s)` and `unicodedata.normalize('NFKD', s)` e.g., where `s = u'Ç'`. Here's a [codepoint for you](http://codepoints.net/U+1F384) – jfs Dec 26 '13 at 12:07
  • @J.F.Sebastian Could you explain why your re.sub command does not have any `+1` for ensuring that the breves are consecutive characters? (like Paulo Freitas's answer here: http://stackoverflow.com/questions/11460855/python-how-to-remove-duplicates-only-if-consecutive-in-a-string) – Jonathan Komar Jan 05 '14 at 22:04
  • 1
    @macmadness86: I've explained how `re.sub('c+', 'c', text)` works – jfs Jan 06 '14 at 08:39