1

I have a sequence of strings which are generally Unicode value of strings without \u in beginning. for example: 00330034 which is equivalent to \u0033\u0034 which leads to 34.

Question is what is the best solution to convert this kind of sequences like 003300340035.... to their proper values in python.

thanks in advance

Reza Torkaman Ahmadi
  • 2,958
  • 2
  • 20
  • 43

2 Answers2

1

Here is the one-line version of user:Green Cloak Guy's answer

>>> s = '00330034'
>>> print (int(''.join(chr(int(x, 16)) for x in map(''.join, zip(*[iter(s)]*4)))))
34
Sunitha
  • 11,777
  • 2
  • 20
  • 23
0
# function to split an iterable into evenly-sized chunks
def chunk(iterable, size):
    idx = 0
    while idx < len(iterable):
        yield iterable[idx:idx+size]
        idx += size

# define the original string
orig_string = "003300340035"
# convert to string of codepoints
unicode_str = "".join(chr(int(codepoint, 16)) for codepoint in chunk(orig_string, 4))

print(unicode_str)
# 345

That last line has several steps going on. To clarify:

  1. Separate the original string into chunks of 4 characters and iterate over them (for codepoint in chunk(orig_string, 4))
  2. Convert each four-character string into an integer, assuming it's in base-16 (int(codepoint, 16))
  3. Get the unicode character with the given integer codepoint (chr())
  4. Join all the individual unicode characters back into a string ("".join())

It'll also only work if your code is exclusively 4-character unicode codepoints. But detecting such things, if they're mixed in, is a separate problem for a separate question.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
  • 1
    this is a bit more complicated that you think, try it with e.g. `D83DDE00` – georg Jul 31 '19 at 16:23
  • This will break for any code points above 65536 – Brad Solomon Jul 31 '19 at 16:23
  • 1
    @georg Dealing with invalid Unicode should be a simple matter of catching the exception from any calling code. But in this case, Python appears to cope just fine - it's just garbage in, garbage out. – tripleee Jul 31 '19 at 16:52
  • @tripleee: the problem is, `D83DDE00` is perfectly valid unicode (UTF-16 to be precise). – georg Jul 31 '19 at 16:58
  • @georg This is a valid concern, but I think it's out-of-scope for the question as asked. I dunno how to detect whether a particular byte is the first half of a UTF-16 codepoint, but so long as there's any pattern to it you could do a workaround by expanding the list comprehension into a proper `for` loop and using a carry variable or something (i.e. "assume the codepoint is 4 characters unless that would be invalid, in which case append the next codepoint to it and use *that*"). – Green Cloak Guy Jul 31 '19 at 17:06
  • It's invalid if the encoding is not UTF-16, which obviously this is not. There's a separate question about handling surrogates https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python – tripleee Jul 31 '19 at 17:07