extract the integer and unicodes from each item in the list

Question

I am doing my project in linguistics (Language is Malayalam).

My list is

x= [u'1\u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200d', u'5\u0d05\u0d35\u0d28\u0d4d\u200d']

I want to extract the integer and unicodes from each item in the list.

The expected output is

1 \u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200  
5 \u0d05\u0d35\u0d28\u0d4d\u200d

First i tried to convert the first item x[0] into ascii

print unicodedata.normalize('NFKD',x[0]).encode('ascii','ignore')

the output is 1 .

I think this output is generated because the unicode in list is for malayalam.

Then i tried to find the first index of "\u" like

x[0].index("\u")

Error occurred by doing this.

Look here for more information on the python `repr` function: http://stackoverflow.com/questions/7784148/understanding-repr-function-in-python — jayelm, Feb 25 '14 at 06:38

isedev · Accepted Answer · 2014-02-25T06:35:02.440

The character sequences \uXXXX represent a single unicode character, not a sequence of characters in the string.

You can get the expected output as follows:

for i in x:
    print int(i[0]), repr(i[1:])[2:-1]

(assuming the integer has only one digit)

For the more general case, one solution is to extract the integer using a regular expression:

import re
for i in x:
    s = re.match('([0-9]+)', i).group(1)
    print int(s), repr(i[len(s):])[2:-1]

Tanveer Alam · Answer 2 · 2014-02-25T07:01:01.403

1

>>> x= [u'1\u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200d', u'5\u0d05\u0d35\u0d28\u0d4d\u200d']  
>>> res = [ (i[:1], i[1:]) for i in x ]
>>> res
[(u'1', u'\u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200d'), (u'5', u'\u0d05\u0d35\u0d28\u0d4d\u200d')]

>>> for i in res:
...     print i[0], repr(i[1])
... 
1 u'\u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200d'
5 u'\u0d05\u0d35\u0d28\u0d4d\u200d'

edited Feb 25 '14 at 07:01

answered Feb 25 '14 at 06:47

Tanveer Alam

5,185
4
22
43

The representation of `res` in the interpreter is the output OP wants, but it's not what you get when you `print` it. You need to use the `repr` function to obtain the object representation. – jayelm Feb 25 '14 at 06:55
Yes i get it, if we use 'print' it prints the actual unicode object. So for that we need to use repr function. Thanks :) – Tanveer Alam Feb 25 '14 at 06:59

extract the integer and unicodes from each item in the list

2 Answers2