0

I receive the following:

value = ['\', 'n']

and my regular routine of converting to unicode and calling ord throws the error:

ord() expects a character, but string of length 2 found

It would seem that I need to join the characters within the list if len(value) > 2.

How do I go about doing this?

sdasdadas
  • 23,917
  • 20
  • 63
  • 148
  • see this answer: http://stackoverflow.com/a/7291240/564538 – Phillip Cloud Sep 06 '13 at 23:39
  • duplicate of: http://stackoverflow.com/q/7291120/564538 – Phillip Cloud Sep 06 '13 at 23:39
  • Can you show us what "my regular routine" looks like? Because doing what you describe, `unicode(value)`, gives you an 11-character string, not 2. (Actually, it doesn't even get that far, because you'll get a `SyntaxError` from trying to enter that `value = ['\', 'n']` line…) – abarnert Sep 06 '13 at 23:40
  • @PhillipCloud: I don't think it is. Presumably his "regular routine" is something like one of the answers to that problem, and his problem is something beyond that which I haven't figured out yet. – abarnert Sep 06 '13 at 23:42
  • In addition to showing us the code that doesn't work, please show us the actual contents of `value` (that is, copy and paste what you get if you `print` it), and the output you're hoping for. – abarnert Sep 06 '13 at 23:54
  • You may want to read the Unicode HOWTO for Python [2.x](http://docs.python.org/2/howto/unicode.html) or [3.x](http://docs.python.org/3/howto/unicode.html) as appropriate. There are also a number of blog posts out there that try to make things clearer; I don't have a specific one to recommend, but Google turns up a bunch of options. – abarnert Sep 07 '13 at 00:13
  • @abarnert You're right. I jumped the gun a bit. – Phillip Cloud Sep 07 '13 at 00:59

1 Answers1

2

If you're trying to figure out how to treat this as a single string '\\n' that can then be interpreted as the single character '\n' according to some set of rules, like Python's unicode-escape rules, you have to decide exactly what you want before you can code it.

First, to turn a list of two single-character strings into one two-character string, just use join:

>>> value = ['\\', 'n']
>>> escaped_character = ''.join(value)
>>> escaped_character
'\\n'

Next, to interpret a two-character escape sequence as a single character, you have to know which escape rules you're trying to undo. If it's Python's Unicode escape, there's a codec named unicode_escape that does that:

>>> character = escaped_character.decode('unicode_escape')
>>> character
u'\n'

If, on the other hand, you're trying to undo UTF-8 encoding followed by Python string-escape, or C backslash escapes, or something different, you obviously have to write something different. And given what you've said about UTF-8, I think you probably do want something different. For example, u'é'.encode('UTF-8') is the two-byte sequence '\xce\xa9'. Just calling decode('unicode_escape') on that will give you the two-character sequence u'\u00c3\u00a9', which is not what you want.

Anyway, now that you've got a single character, just call ord:

>>> char_ord = ord(character)
>>> char_ord
10

I'm not sure what the convert-to-unicode bit is about. If this is Python 3.x, the strings are already Unicode. If it's 2.x, and the strings are ASCII, it's guaranteed that ord(s) == ord(unicode(s)). If it's 2.x, and the strings are in some other encoding, just calling unicode on them is going to give you a UnicodeError or mojibake; you need to pass an encoding in as well, in which case you might as well use the decode method.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Sorry, my question probably isn't very clear (and my knowledge about encoding probably isn't bang on, either). My goal is to convert the character '\n' to its UTF-8 code point and from there convert it to binary/decimal. – sdasdadas Sep 06 '13 at 23:49
  • @sdasdadas: First, where does the character `'\n'` come from in your example? Second, a single character can be 1-6 bytes in UTF-8, so "its UTF-8 code point" is meaningless. And if you have a UTF-8 string, and you want to get the numerical value of each byte, just call `ord` on each byte; no need to convert to a Unicode string unless you want to get the numerical values of the Unicode characters that the bytes decode to. – abarnert Sep 06 '13 at 23:51
  • The character '\n' comes from a parser which returns the string `"\n"`. The code for `''.join(...)` solves what I initially meant to ask, so I thank you for that. I wanted to implement UTF-8 support within the parser and, for some reason, I foolishly assumed that the code points are padded with 0's. The term variable-width encoding makes a lot more sense to me now... – sdasdadas Sep 07 '13 at 00:01
  • `''.join(['\\', '\n'])` doesn't give you the single-character string `'\n'`, it give you the two-character string `'\\n'`. If you want to parse that into `'\n'`, you need to do that explicitly (e.g., by using the `unicode_escape` quasi-codec… if that's the appropriate rule for your data). – abarnert Sep 07 '13 at 00:06
  • @sdasdadas: Meanwhile, is the parser giving you a list of single-byte UTF-8 possibly-partial-character strings, or single-character possibly-multiple-byte UTF-8 strings, or…? Until you know what you have, you can't know how to deal with it. – abarnert Sep 07 '13 at 00:10
  • Sorry, I missed this last comment. Your answer solved my issue (a while back) but to answer your question: the parser was giving me single-character multi-byte strings. – sdasdadas Sep 23 '13 at 16:28