Zipping together unicode strings in Python

Question

I have the string:

a = "ÀÁÂÃÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜ" b = "àáâãäèéçêëìíîïòóôõöùúûüÿ"

and I want to create the string

"ÀàÁáÂâ..."

I.e split the string in two and then zip the halves together.

I tried the naive zip(a, b) but this didn't work. I think this is due to a problem with unicode.

Does anyone know how I can get the result I want?

@Nick, When I tried to zip then join the strings the output was `�À��Á��Â��Ã��È��É��Ê��Ë��Ì��Í��Î��Ï��Ò��Ó��Ô��Õ��Ö��Ù��Ú��Û��Ü�` — Ben Page, May 18 '12 at 09:40
@BenPage In Python 2.7,be sure you declare your strings as unicode: either prefix strings with a `u`, or use `from __future__ import unicode_literals` — Zeugma, May 18 '12 at 09:41

Zeugma · Accepted Answer · 2012-05-18T15:11:03.420

In Python 2.x, strings are not unicode by default. When dealing with unicode data, you have to do the following:

prefix string literals with u character: a = u'ÀÁÂÃÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜ', or
if you want to avoid the u prefix and if the modules you are working with are enough compatible, use from __future__ import unicode_literals import to make string literals interpreted as unicode by default
if you write unicode string literals directly in your Python code, save your .py file in utf-8 format so that the literals are correctly interpreted. Python 2.3+ will interpret the utf-8 BOM ; a good practice is also to add a specific comment line at the beginning of the file to indicate the encoding like # -*- coding: utf-8 -*-, or
you can also keep saving the .py file in ascii, but you will need to escape the unicode characters in the literals, which can be less readable: 'ÀÁÂÃ' should become '\xc0\xc1\xc2\xc3'

Once you fulfill those conditions, the rest is about applying algorithms on those unicode strings the same way you would work with the str version. Here is one possible solution for your problem with the __future__ import:

from __future__ import unicode_literals

from itertools import chain
a = "ÀÁÂÃÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜ"
b = "àáâãäèéçêëìíîïòóôõöùúûüÿ"

print ''.join(chain(*zip(a,b)))

>>> ÀàÁáÂâÃãÈäÉèÊéËçÌêÍëÎìÏíÒîÓïÔòÕóÖôÙõÚöÛùÜú

Further references:

PEP 263 defines the non-ascii encoding comments
PEP 3120 defines utf-8 as the default encoding in Python 3

Great, elaborate answer. I'd add a word about `#coding=` to the bullet #3. — georg, May 18 '12 at 13:43

Josiah · Answer 2 · 2012-05-18T10:00:29.207

3

You have to join them up after you zip them, and also you need to define them as unicode strings:

>>>import itertools
>>>a = u"ÀÁÂÃÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜ"
>>>b = u"àáâãäèéçêëìíîïòóôõöùúûüÿ"
>>>zipped = itertools.izip_longest(a,b, fillvalue="")
>>>print "".join(["".join(x) for x in zipped])

ÀàÁáÂâÃãÈäÉèÊéËçÌêÍëÎìÏíÒîÓïÔòÕóÖôÙõÚöÛùÜúûüÿ

>>>zipped = itertools.izip_longest(a,b, fillvalue="")
>>>print "".join(map("".join, zipped))

ÀàÁáÂâÃãÈäÉèÊéËçÌêÍëÎìÏíÒîÓïÔòÕóÖôÙõÚöÛùÜúûüÿ

edited May 18 '12 at 10:00

answered May 18 '12 at 09:35

Josiah

3,266
24
24

I think I had a newbie mistake of not defining them as unicode strings and itertools.izip_longest will solve the lists being different sizes :) – Ben Page May 18 '12 at 09:45

Gandi · Answer 3 · 2012-05-18T09:40:24.617

Maybe not beautiful, but working one.

>>> a_longer = len(a) > len(b)
>>> new_string = ""
>>> for i in range((min(len(a), len(b)))):
...     new_string += a[i] + b[i]
... 
>>> if a_longer:
...     new_string += a[i:]
... else:
...     new_string += b[i:]
... 
>>> print new_string
ÀàÁáÂâÃãÈäÉèÊéËçÌêÍëÎìÏíÒîÓïÔòÕóÖôÙõÚöÛùÜúúûüÿ

Or, with using zip:

>>> a = u'ÀÁÂÃÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜ'
>>> b = u'àáâãäèéçêëìíîïòóôõöùúûüÿ'
>>> c = zip(a, b)
>>> new_string = "".join([a + b for a, b in c])
>>> print new_string
ÀàÁáÂâÃãÈäÉèÊéËçÌêÍëÎìÏíÒîÓïÔòÕóÖôÙõÚöÛùÜú

But watch out, that a zip method will not give you the rest of the 'b' string as it does not have a pair in 'a' string.

itertools.izip_longest will solve the lists being different sizes — Ben Page, May 18 '12 at 09:44

math · Answer 4 · 2012-05-18T09:42:38.030

0

This is working on my side (Python 2.x):

>>> a = unicode('ÀÁÂÃÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜ', 'utf-8')
>>> b = unicode('àáâãäèéçêëìíîïòóôõöùúûüÿ', 'utf-8')
>>> print ''.join([ ''.join(c) for c in zip(a, b)])
ÀàÁáÂâÃãÈäÉèÊéËçÌêÍëÎìÏíÒîÓïÔòÕóÖôÙõÚöÛùÜú

What error do you have?

edited May 18 '12 at 09:42

answered May 18 '12 at 09:37

math

2,811
3
24
29

when i tried to use the unicode function ( I wasn't adding the 'utf-8' param, i got `'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)` but if i define the string as u"..." it works. :) – Ben Page May 18 '12 at 09:45

Zipping together unicode strings in Python

4 Answers4

Linked