Invisible unicode characters loaded to DB in python

Question

There are many questions and fixes for this but none seems to work for me. My problem is I am reading a file with strings and loading each line into DB.

In file it is looking like normal text,while in DB it is read as a unicode space. I tried replacing it with a space and similar options but none worked.

For example in text file the string will be like:

The abrupt departure

After inserted in DB, there it is looking like:

The abruptÂ departure

When I am trying to run query for data in DB, it is looking like:

"The abrupt\xc2\xa0departure"

I tried the following:

if "\xc2\xa0"  in str: 
     str.replace('\xa0', ' ')
     str.replace('\xc2', ' ')
     print str

the above code is printing the string like:

The abrupt departure

but while inserting back to DB, it is still the same.

Any help is appreciated.

`str.replace()` doesn't do anything to the string. – Ignacio Vazquez-Abrams Sep 29 '16 at 07:58 — Ignacio Vazquez-Abrams, Sep 29 '16 at 07:58

Harsha Biyani · Accepted Answer · 2016-09-29T08:27:32.163

1

Try this:

This will remove Unicode character

>>> s = "The abruptÂ departure"
>>> s = s.decode('unicode_escape').encode('ascii','ignore')
>>> s
'The abrupt departure'

Or, You can try with replace as you have tried. But you forget to reassign to same variable.

>>> s = "The abruptÂ departure"
>>> s = s.replace('\xc2', '').replace('\xa0','')
>>> s
'The abrupt departure'

edited Sep 29 '16 at 08:27

answered Sep 29 '16 at 08:02

Harsha Biyani

7,049
9
37
61

This worked but I got it like The abruptdeparture without space in between. – user168983 Sep 29 '16 at 08:08
Which python version you are using? – Harsha Biyani Sep 29 '16 at 08:09
it is 2.7 does it matter? – user168983 Sep 29 '16 at 08:15
the way it is read is not like "The abruptÂ departure" but like this "The abrupt\xc2\xa0departure" – user168983 Sep 29 '16 at 08:19

score 1 · Answer 2 · answered Sep 29 '16 at 08:26

1

The point is strings are immutable, you need to assign the return value from replace:

 s = s.replace('\xa0', ' ')
 s = s.replace('\xc2', ' ')

Also, don't use str as a variable name.

answered Sep 29 '16 at 08:26

Daniel Roseman

588,541
66
880
895

score 1 · Answer 3 · edited May 23 '17 at 12:33

C2A0 is a "NO-BREAK SPACE". 'Â ' is what you see if your CHARATER SET settings are inconsistent.

Doing a replace() is merely masking the problem, and not helping when a different funny character comes into your table.

Since you have not provided enough info to say what you have done correctly versus incorrectly, let me point you at two references:

Invisible unicode characters loaded to DB in python

3 Answers3