3

I have an sqlite db that has some crazy ascii characters in it and I would like to remove them, but I have no idea how to go about doing it. I googled some stuff and found some people saying to use REGEXP with mysql, but that threw an error saying REGEXP wasn't recognized.

Here is the error I get:

sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'table_name' with text ...

Thanks for the help

imns
  • 4,996
  • 11
  • 57
  • 80
  • 2
    Are you sure you want to get rid of "crazy" characters? Learning how to deal with all unicode characters is actually kind of fun... – unutbu Aug 27 '10 at 18:18
  • so are they ASCII characters or UTF-8? Since your are using SQLAlchemy, it is already handling UTF-8 just fine, but you are probably confused about what to do with it once you get it. http://docs.python.org/howto/unicode.html – msw Aug 27 '10 at 18:20
  • 1
    ~unutbu: depends on your definition of "fun" ;) I'd certainly call it "useful" and "initially daunting", and "worthwhile" but "fun" never crossed my mind. – msw Aug 27 '10 at 18:22
  • I would love to learn how, and I hate to be douchy but I am under a tight deadline right now. Any help or advice would be much appreciated. – imns Aug 27 '10 at 18:36
  • @msw no they are non-ascii characters (i think). SQLAlchemy throws an error on the select statement. I'm not even doing anything with the characters yet. – imns Aug 27 '10 at 18:47
  • Maybe I'm being dense, but it's not even clear to me what you're asking for. Do you want to remove all non-ascii characters from the column's UTF-8 value, or all characters that are not in the printable ascii range? – PeterAllenWebb Aug 27 '10 at 19:02
  • 2
    You need to show more code. Don't tell us it threw an error, show us. – msw Aug 27 '10 at 19:24
  • The title refers to "non utf-8 characters". Is that the problem? The table is being treated as utf-8, but it is in some other encoding? – Janne Karila Dec 14 '11 at 07:56

2 Answers2

1

Well, if you really want to shoehorn a rich unicode string into a plain ascii string (and don't mind some goofs), you could use this:

import unicodedata as ud
def shoehorn_unicode_into_ascii(s):
    # This removes accents, but also other things, like ß‘’“”
    return ud.normalize('NFKD', s).encode('ascii','ignore')

For a more complete solution (with somewhat fewer goofs, but requiring a third-party module unidecode), see this answer.

Really, though, the best solution is to work with unicode data throughout your code as much as possible, and drop to an encoding only when necessary.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I was actually looking for an sql statement I could use then I don't have to deal with it. I actually already tried this function: def removeNonAscii(s): return "".join(i for i in s if ord(i)<128) and I still get the error because it's happening when I do the select in SQLAlchemy I think. – imns Aug 27 '10 at 18:50
  • @bababa: Can you draw out the text with a simple SQL SELECT statement like `SELECT * from table` ? If so, it would not be hard to write a one-off script to read in the data, and use `UPDATE`s to write back ascii data... Very low-brow, but expedient. – unutbu Aug 27 '10 at 19:00
0

django.utils.encoding has a greate set of robust unicode encoding and decoding functions.

Piper Merriam
  • 2,774
  • 2
  • 24
  • 30