5

I am a bit stuck here. I have this code, which unescapes html elements inside the text and encodes it into utf8.

import HTMLParser

def clean_text(text):
    htmlparser = HTMLParser.HTMLParser()
    return htmlparser.unescape(
        ' '.join(text.replace('\n', '').split())
    ).replace(';', ',').encode('utf-8').strip()

and I am using mysql (God save me from it!)

and this code is running in two projects. in first project, the code works well, no problems. In the other project, the string will be saved like this:

Die Verbindungen zwischen Dinosauriern und Vögeln immer stärker

It should be

Die Verbindungen zwischen Dinosauriern und Vögeln immer stärker

I am using in both projects django 1.7 and python 2.7.9

what am I missing? mysql collocation is utf8_general_ci and chatset is utf8. both mysql dbs are the same in settings.

it would be a miracle to solve this issue... I give a warm hug and kiss if someone could help me debug this thing

doniyor
  • 36,596
  • 57
  • 175
  • 260

2 Answers2

0

Vögeln --> Vögeln is an example of Mojibake

  • The bytes you have in the client are correctly encoded in utf8 (good).
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.

Perhaps useful: Django character latin1 mysql Incorrect string value in python+django+Mysql

Checklist for Python:

  • # -*- coding: utf-8 -*- -- (for literals in code)
  • charset='utf8' in connect() call -- Is that buried in bottle_mysql.Plugin? (Note: Try 'utf-8' and 'utf8')
  • Text encoded in utf8.
  • No need for encode() or decode() if you are willing to accept utf8 everywhere.
  • u'...' for literals
  • <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> near start of html page
  • Content-Type: text/html; charset=UTF-8 (in HTTP response header)
  • header('Content-Type: text/html; charset=UTF-8'); (in PHP to get that response header)
  • CHARACTER SET utf8 COLLATE utf8_general_ci on column (or table) definition in MySQL.
  • [[UTF-8 all the way through all the way through]]
  • Use MySQL Connector/Python instead of pyodbc and MySQL Connector/ODBC

(@DanielRoseman -- Have I stated anything incorrectly?)

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222
-1

On top of your file mention encoding # coding: utf-8 and it will work like charm.

Tarun Behal
  • 908
  • 6
  • 11
  • does it really read `# coding: utf-8`? because I know only `# -*- coding: utf-8 -*-` and I have it already in it – doniyor Dec 14 '15 at 11:08
  • I used your code and specified the coding and it worked. :) – Tarun Behal Dec 14 '15 at 11:09
  • 2
    I think this is an ambiguous answer to an ambiguous question. – bgusach Dec 14 '15 at 11:12
  • 4
    **Please** don't randomly recommend this. The coding declaration only affects literal text within the code itself; this question is asking about retrieving text from the database, where a coding declaration will have no effect at all. – Daniel Roseman Dec 14 '15 at 12:34