Get proper list from list of unicode list

Question

I have a list with a unicode string in a form of a list.

my_list = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']

I want a list which I am able to iterate such as;

name_list = [James, Williams, Kevin, Parker, Alex, Emma, Katie, Annie]

I have tried several possible solutions given here, but none of them worked in my case.

# Tried
name_list =  name_list.encode('ascii', 'ignore').decode('utf-8')

#Gives unicode return type

# Tried
ast.literal_eval(name_list)

#Gives me invalid token error

If you are only just learning Python, you should ignore Python 2. The currently supported and recommended version of the language is Python 3. — tripleee, Nov 27 '18 at 07:35
`name_list = [James, Williams...` is illegal in Python. What you want is `name_list = ['James', 'Williams'...` — DYZ, Nov 27 '18 at 07:39

score 2 · Accepted Answer · answered Nov 27 '18 at 07:42

Firstly, a list does not have a encode method, you have to apply any string methods on the item in the list.

Secondly, if you are looking at normalizing the string, you can use the normalize function from Python's unicodedata library, read more here, this removes the unwanted characters '\xa0' and will help you normalize any other characters.

Then instead of using eval which is generally unsafe, use a list comprehension to build a list:

import unicodedata

li = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']
inner_li = unicodedata.normalize("NFKD", li[0]) #<--- notice the list selection

#get only part of the string you want to convert into a list
new_li = [i.strip() for i in inner_li[1:-1].split(',')] 
new_li
>> ['James', 'Williams', 'Kevin', 'Parker', 'Alex', 'Emma', 'Katie', 'Annie']

In your expected output, they are actually a list of variables, which unless declared before, will give you an error.

bunbun · Answer 2 · 2018-11-27T08:38:49.930

0

import unicodedata

lst = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']
lst = unicodedata.normalize("NFKD", lst[0])
lst2 = lst[1:-1].split(", ") # remove open and close brackets
print(lst2)

output will be:

["James", "Williams", "Kevin", "Parker", "Alex", "Emma", "Katie ", "Annie"]

if you want to remove all spaces leading/trailing whitespaces:

lst3 = [i.strip() for i in lst2]
print(lst3)

output will be:

["James", "Williams", "Kevin", "Parker", "Alex", "Emma", "Katie", "Annie"]

edited Nov 27 '18 at 08:38

answered Nov 27 '18 at 07:34

bunbun

2,595
3
34
52

tried above lst2 = list[0][1:-1].split(", "), but it gives an error of "list index out of range". – Rachel Nov 27 '18 at 07:50
Yes i tried the same.. i have used proper variable name. – Rachel Nov 27 '18 at 07:52
sorry it worked now.!! thank you. Not sure why for first run it gave error. But it gives me a list of unicode names, [u'James', u'Williams', u'Kevin', u'Parker', u'Alex', u'Emma', u'Katie\xa0', u'Annie']. While printing when it comes to 'Katie\xa0' it throws UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' – Rachel Nov 27 '18 at 08:27
@Rachel you can use `unicodedata.normalize` as BernardL mentioned. `lst[0] = unicodedata.normalize("NFKD", lst[0])` – bunbun Nov 27 '18 at 08:33
I've added more explanation – bunbun Nov 27 '18 at 08:39

score 0 · Answer 3 · answered Nov 27 '18 at 07:42

0

This is a good application for regular expressions:

import re
body = re.findall(r"\[\s*(.+)\s*]", my_list[0])[0] # extract the stuff in []s
names = re.split("\s*,\s*", body) # extract the names
#['James', 'Williams', 'Kevin', 'Parker', 'Alex', 'Emma', 'Katie', 'Annie']

answered Nov 27 '18 at 07:42

DYZ

55,249
10
64
93

Um, why is this a good application for regular expressions? If the input is regular in the first place, just `names = value.strip('[]').split(', ')` – tripleee Nov 27 '18 at 07:43
@tripleee For starters, your solution does not remove '\xa0'. As a follow-up, regular expressions take care of all possible stray spaces before and after commas and brackets. Finally, you are welcome to post your answer separately. – DYZ Nov 27 '18 at 07:45
If the input *isn't* regular, why is `\xa0` something we specifically worry about? How about other junk and typos? Anyway, the question is just too unclear to deserve answering IMHO. – tripleee Nov 27 '18 at 07:47
@tripleee Firstly, because the OP explicitly requested to have it removed. Secondly, because `\xa0` _is_ a whitespace, and my solution does remove all white spaces. – DYZ Nov 27 '18 at 07:49

Get proper list from list of unicode list

3 Answers3