How do I convert a list of strings to a unicode value?

Question

I receive the following:

value = ['\', 'n']

and my regular routine of converting to unicode and calling ord throws the error:

ord() expects a character, but string of length 2 found

It would seem that I need to join the characters within the list if len(value) > 2.

How do I go about doing this?

Can you show us what "my regular routine" looks like? Because doing what you describe, `unicode(value)`, gives you an 11-character string, not 2. (Actually, it doesn't even get that far, because you'll get a `SyntaxError` from trying to enter that `value = ['\', 'n']` line…) — abarnert, Sep 06 '13 at 23:40
@PhillipCloud: I don't think it is. Presumably his "regular routine" is something like one of the answers to that problem, and his problem is something beyond that which I haven't figured out yet. — abarnert, Sep 06 '13 at 23:42
In addition to showing us the code that doesn't work, please show us the actual contents of `value` (that is, copy and paste what you get if you `print` it), and the output you're hoping for. — abarnert, Sep 06 '13 at 23:54
You may want to read the Unicode HOWTO for Python [2.x](http://docs.python.org/2/howto/unicode.html) or [3.x](http://docs.python.org/3/howto/unicode.html) as appropriate. There are also a number of blog posts out there that try to make things clearer; I don't have a specific one to recommend, but Google turns up a bunch of options. — abarnert, Sep 07 '13 at 00:13

abarnert · Accepted Answer · 2013-09-07T00:08:29.763

2

If you're trying to figure out how to treat this as a single string '\\n' that can then be interpreted as the single character '\n' according to some set of rules, like Python's unicode-escape rules, you have to decide exactly what you want before you can code it.

First, to turn a list of two single-character strings into one two-character string, just use join:

>>> value = ['\\', 'n']
>>> escaped_character = ''.join(value)
>>> escaped_character
'\\n'

Next, to interpret a two-character escape sequence as a single character, you have to know which escape rules you're trying to undo. If it's Python's Unicode escape, there's a codec named unicode_escape that does that:

>>> character = escaped_character.decode('unicode_escape')
>>> character
u'\n'

If, on the other hand, you're trying to undo UTF-8 encoding followed by Python string-escape, or C backslash escapes, or something different, you obviously have to write something different. And given what you've said about UTF-8, I think you probably do want something different. For example, u'é'.encode('UTF-8') is the two-byte sequence '\xce\xa9'. Just calling decode('unicode_escape') on that will give you the two-character sequence u'\u00c3\u00a9', which is not what you want.

Anyway, now that you've got a single character, just call ord:

>>> char_ord = ord(character)
>>> char_ord
10

I'm not sure what the convert-to-unicode bit is about. If this is Python 3.x, the strings are already Unicode. If it's 2.x, and the strings are ASCII, it's guaranteed that ord(s) == ord(unicode(s)). If it's 2.x, and the strings are in some other encoding, just calling unicode on them is going to give you a UnicodeError or mojibake; you need to pass an encoding in as well, in which case you might as well use the decode method.

edited Sep 07 '13 at 00:08

answered Sep 06 '13 at 23:46

abarnert

354,177
51
601
671

Sorry, my question probably isn't very clear (and my knowledge about encoding probably isn't bang on, either). My goal is to convert the character '\n' to its UTF-8 code point and from there convert it to binary/decimal. – sdasdadas Sep 06 '13 at 23:49
@sdasdadas: First, where does the character `'\n'` come from in your example? Second, a single character can be 1-6 bytes in UTF-8, so "its UTF-8 code point" is meaningless. And if you have a UTF-8 string, and you want to get the numerical value of each byte, just call `ord` on each byte; no need to convert to a Unicode string unless you want to get the numerical values of the Unicode characters that the bytes decode to. – abarnert Sep 06 '13 at 23:51
The character '\n' comes from a parser which returns the string `"\n"`. The code for `''.join(...)` solves what I initially meant to ask, so I thank you for that. I wanted to implement UTF-8 support within the parser and, for some reason, I foolishly assumed that the code points are padded with 0's. The term variable-width encoding makes a lot more sense to me now... – sdasdadas Sep 07 '13 at 00:01
`''.join(['\\', '\n'])` doesn't give you the single-character string `'\n'`, it give you the two-character string `'\\n'`. If you want to parse that into `'\n'`, you need to do that explicitly (e.g., by using the `unicode_escape` quasi-codec… if that's the appropriate rule for your data). – abarnert Sep 07 '13 at 00:06
@sdasdadas: Meanwhile, is the parser giving you a list of single-byte UTF-8 possibly-partial-character strings, or single-character possibly-multiple-byte UTF-8 strings, or…? Until you know what you have, you can't know how to deal with it. – abarnert Sep 07 '13 at 00:10
Sorry, I missed this last comment. Your answer solved my issue (a while back) but to answer your question: the parser was giving me single-character multi-byte strings. – sdasdadas Sep 23 '13 at 16:28

How do I convert a list of strings to a unicode value?

1 Answers1