2

Trying to replace or strip strings in this list to insert into a database which does not allow them

info=[[u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05'], [u' \xa0RCKIN 0 - 1 WITHACK.nq\xa0  ', u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05']]

I used this code

info = [[x.replace(u'\xa0', u'') for x in l] for l in info]
info = [[y.replace('\u2019s', '') for y in o] for o in info]

the first line worked but the second one not, any suggestions ?

Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
user3386406
  • 99
  • 1
  • 3
  • 10
  • I would also try to figure out why are you getting such a weird string mixing raw bytes and unicode codepoints. – Paulo Bu Mar 06 '14 at 15:08
  • 1
    What you should do is byte the bullet and learn how to handle unicode by decoding it when you read the string and then encoding it when you are ready to send it to your database. One place to start is here http://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors – PyNEwbie Mar 06 '14 at 15:09

2 Answers2

5

Drop the second line and do:

info = [[x.encode('ascii', 'ignore')  for x in l] for l in info]

and see if the results are acceptable. This will attempt to convert all the unicode to ascii and drop any characters that fail to convert. You just want to be sure that if you lose an important unicode character, it's not a problem.

>>> info=[[u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05'], [u' \xa0RCKIN 0 - 1 WITHACK.nq\xa0  ', u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05']]
>>> info = [[x.encode('ascii', 'ignore')  for x in l] for l in info]
>>> info
[['Buffalos League of legends ...', '2012-09-05'], [' RCKIN 0 - 1 WITHACK.nq  ', 'Buffalos League of legends ...', '2012-09-05']]

What's going on:

You have data in your Python program that's Unicode (and that's good.)

>>> u = u'\u2019'

Best practice, for interoperability, is to write Unicode strings out to utf-8. These are the bytes you should be storing in your database:

>>> u.encode('utf-8')
'\xe2\x80\x99'
>>> utf8 = u.encode('utf-8')
>>> print utf8
’

And then when you read those bytes back into your program, you should then decode them:

>>> utf8.decode('utf8')
u'\u2019'
>>> print utf8.decode('utf8')
’

If your database can't handle utf-8 then I would consider getting a new database.

Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
4

Because in the second form \u2019s is not considered as unicode string. Just prepend u in the replace before that element like this

print [[y.replace(u'\u2019s', '') for y in o] for o in info]]

Output

[[u'Buffalo League of legends ...', u'2012-09-05'],
 [u' RCKIN 0 - 1 WITHACK.nq  ',
  u'Buffalo League of legends ...',
  u'2012-09-05']]

Infact you can chain the replace, like this

[[x.replace(u'\xa0', '').replace(u'\u2019s', '') for x in l] for l in info]
thefourtheye
  • 233,700
  • 52
  • 457
  • 497