-3

When i try to enter data in a text field in Greek letters my wsgi script saves that data as jibberish in the MySQL database and i dont know why. Here is the relative code when the data is about to be posted via form method:

pdata = pdata + '''
<form methods="POST" enctype="multipart/form-data" action="%s">
    <tr>
            <td> <center>   <input type="text"  name="task"     size=50>    </td>
            <td> <center>   <input type="text"  name="price"    size=5>     </td>
            <td> <center>   <input type="text"  name="lastvisit">           </td>
        </table><br><br>
        <td>    <input type="image" src="/static/img/submit.gif" name="update" value="Ενημέρωση!">  </td>
    </tr>
</form>
''' % app.get_url( '/update/<name>', name=name )


pdata = pdata + "<meta http-equiv='REFRESH' content='200;%s'>" % app.get_url( '/' )
return pdata

And here is the relative callback function who tries to enter the posted-form data into MySQL database.

@app.route( '/update/<name>' )

def update( name ):

pdata = ''

task = request.query.get('task')
price = request.query.get('price')
lastvisit = request.query.get('lastvisit')


# check if date entered as intented, format it properly for MySQL
lastvisit = datetime.strptime(lastvisit, '%d %m %Y').strftime('%Y-%m-%d')


if( ( task and len(task) <= 200 ) and ( price and price.isdigit() and len(price) <= 3 ) and lastvisit != "error" ):
    # find the requested client based on its name
    cur.execute('''SELECT ID FROM clients WHERE name = %s''', name )
    clientID = cur.fetchone()[0]

    try:
        # found the client, save primary key and use it to issue hits & money UPDATE
        cur.execute('''UPDATE clients SET hits = hits + 1, money = money + %s WHERE ID = %s''', ( int(price), clientID ))

        # update client profile by adding a new record
        cur.execute('''INSERT INTO jobs (clientID, task, price, lastvisit) VALUES (%s, %s, %s, %s)''', ( clientID, task, price, lastvisit ))
    except:
        cur.rollback()

I cannot understand why the data is stored into database as jibberish instead of proper utf-8. Also trying to use utf-8 encoding type didn't work either.

<form methods="POST" enctype="utf-8" action="%s">
  • Dumb question, but is the column in the DB set up in a such a way as to accept UTF-8? It could also be a back end issue instead – SomeStudent May 28 '19 at 13:39
  • Yes MySQL Tables and Columns are configured to be utf8_general_ci. – Νικόλαος Βέργος May 28 '19 at 13:44
  • Can you show an example of the data in python, and the resulting corrupted data in MySQL? `repr` of both would be good. – snakecharmerb May 28 '19 at 14:01
  • Sure. The html form data to be posted is "αυτή είναι μια δοκιμή" and the end result inside database is "αÏÏή είναι μια δοκιμή" – Νικόλαος Βέργος May 28 '19 at 14:13
  • 1
    It sounds like MySQL is not properly configured. You can see how your text is being mis-coded by pasting it here, accepting the defaults and looking at the output: http://string-functions.com/encodedecode.aspx You can also glean some info from these other answers: https://stackoverflow.com/questions/6202726/writing-utf-8-string-to-mysql-with-python and https://stackoverflow.com/questions/4404768/cant-insert-russian-text-into-mysql-database – jimf May 28 '19 at 14:19
  • After visiting that link i tried to `utf8` the string "αυτή είναι μια δοκιμή" and decode as `iso-8859-1` produced "αÏÏή είναι μια δοκιμή" which makes me wonder why the string is converted in latin-iso when about to be stored into the db. MySQL Tables and Columns are configured to be `utf8_general_ci` so how latin-iso gets into way? – Νικόλαος Βέργος May 28 '19 at 14:26
  • After trying encoding/decoding i noticed that the only way for tha data to be passed correctly to the database is to first encode and then deocde it like so: `name = request.query.get('name').encode('iso-8859-1').decode('utf-8')` Can somebody tell me WHY all the html form fileds have to first follow the above process in order for that posted data to be passed correctly into the database? – Νικόλαος Βέργος May 28 '19 at 16:04
  • Looking at the docs for Bottle, it seems to do some unwarranted encoding fuckery when parsing query strings. Trying `request.query.getunicode` instead of `request.query.get` doesn't require of me to encode and then decode the html form fields prior to print them or inserting them to the database. So, i guess this is a Bottle bug?! – Νικόλαος Βέργος May 28 '19 at 16:40
  • 1
    Is the data saved correctly if you use the dot-notation suggested in the docs? For example `request.query.task` instead of `request.query.get('task')`. – snakecharmerb May 28 '19 at 17:28
  • Yes indeed! It appears as if i use `task= request.query.getunicode('task')` has the same correct functionality as `task= request.query.task` which is even simpler to write. – Νικόλαος Βέργος May 28 '19 at 17:50
  • Did it work out for you? – jimf May 29 '19 at 08:48

2 Answers2

1

According to wsgi_mod documentation, the WSGIDaemonProcess default encoding is ASCII. Greek characters are not included in the ASCII and your input is not decoded properly. If you want to allow Greek characters, you have to use either UTF-8 or iso-8859-1. Usually servers are deamons that are initiated by the init system and 99% of the times still uses ASCII as the default encoding. When developing or debugging you don't normally face these issues as the python scripts inherit the environment of the current user which typically uses UTF-8.

$env
.....
LANG=en_GB.UTF-8
.....

Quoting from the wsgi_mod for apache:

lang=locale Set the current language locale. This is the same as having set the LANG environment variable. You will need to set this on many Linux systems where Apache when started up from system init scripts uses the default C locale, meaning that the default system encoding is ASCII. Unless you need a special language locale, set this to en_US.UTF-8. Whether the lang or locale option works best can depend on the system being used. Set both if you aren’t sure which is appropriate.

locale=locale Set the current language locale. This is the same as having set the LC_ALL environment variable. You will need to set this on many Linux systems where Apache when started up from system init scripts uses the default C locale, meaning that the default system encoding is ASCII. Unless you need a special language locale, set this to en_US.UTF-8. Whether the lang or locale option works best can depend on the system being used. Set both if you aren’t sure which is appropriate.

georgeok
  • 5,321
  • 2
  • 39
  • 61
1

The html form data to be posted is "αυτή είναι μια δοκιμή" and the end result inside database is "αÏÏή είναι μια δοκιμή"

However, "αυτή είναι μια δοκιμή" is apparently Invalid UTF-8 because Byte at position 38 (ή) indicates it's a two-byte UTF-8 character but only 1 byte follows (reference).

If this is exactly the data that is passed to the code; then you need to check and confirm that the HTML form is submitting data in the correct UTF-8 format.

<form accept-charset='UTF-8'>

Assuming your input string is correctly UTF-8 encoded, your output string "αÏÏή είναι μια δοκιμή" is UTF-7 or more probably ISO-8859-1 encoded (reference).

Therefore the issue may be the transport mechanism (as defined above; in the HTML form submission) or the database storage encoding.

Yes MySQL Tables and Columns are configured to be utf8_general_ci.

This may also be a problem. MySQL utf8_ is NOT full UTF-8 (wat?!) as it is 3-byte instead of 4-byte; therefore if you have a 4-byte UTF-8 character stored, it will offset all later character bytes and make them look like rubbish.

Solution:

Update your MySQL columns and all collations to utf8mb4_unicode_ci

Also check to ensure your MySQL transport mechanism is using utf8mb4_ as well.

And Please Read This

Community
  • 1
  • 1
Martin
  • 22,212
  • 11
  • 70
  • 132