2

I am doing my project in linguistics (Language is Malayalam).

My list is

x= [u'1\u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200d', u'5\u0d05\u0d35\u0d28\u0d4d\u200d']  

I want to extract the integer and unicodes from each item in the list.

The expected output is

1 \u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200  
5 \u0d05\u0d35\u0d28\u0d4d\u200d

First i tried to convert the first item x[0] into ascii

print unicodedata.normalize('NFKD',x[0]).encode('ascii','ignore') 

the output is 1 .

I think this output is generated because the unicode in list is for malayalam.

Then i tried to find the first index of "\u" like

x[0].index("\u")

Error occurred by doing this.

isedev
  • 18,848
  • 3
  • 60
  • 59
user3251664
  • 441
  • 2
  • 7
  • 11
  • Look here for more information on the python `repr` function: http://stackoverflow.com/questions/7784148/understanding-repr-function-in-python – jayelm Feb 25 '14 at 06:38

2 Answers2

1

The character sequences \uXXXX represent a single unicode character, not a sequence of characters in the string.

You can get the expected output as follows:

for i in x:
    print int(i[0]), repr(i[1:])[2:-1]

(assuming the integer has only one digit)

For the more general case, one solution is to extract the integer using a regular expression:

import re
for i in x:
    s = re.match('([0-9]+)', i).group(1)
    print int(s), repr(i[len(s):])[2:-1]
isedev
  • 18,848
  • 3
  • 60
  • 59
1
>>> x= [u'1\u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200d', u'5\u0d05\u0d35\u0d28\u0d4d\u200d']  
>>> res = [ (i[:1], i[1:]) for i in x ]
>>> res
[(u'1', u'\u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200d'), (u'5', u'\u0d05\u0d35\u0d28\u0d4d\u200d')]

>>> for i in res:
...     print i[0], repr(i[1])
... 
1 u'\u0d30\u0d3e\u0d2e\u0d28\u0d4d\u200d'
5 u'\u0d05\u0d35\u0d28\u0d4d\u200d'
Tanveer Alam
  • 5,185
  • 4
  • 22
  • 43
  • The representation of `res` in the interpreter is the output OP wants, but it's not what you get when you `print` it. You need to use the `repr` function to obtain the object representation. – jayelm Feb 25 '14 at 06:55
  • Yes i get it, if we use 'print' it prints the actual unicode object. So for that we need to use repr function. Thanks :) – Tanveer Alam Feb 25 '14 at 06:59