Issue with python encoding and json.dumps()

Question

I want to store heading tags into mysql, I need to store from different languages (e.g. english, persian, arabic and etc) For example my string must be something like below:

{"h1": "زبان فارس - english"}

But when I want to store in my db the unicode changing to something like below:

{"h1": "\u0628\u0631\u062e\u0648\u0631\u062f"}

My python 3 code is:

    data = {}
    if not soup.find('h1'):
        h1 = ""
    else:
        heading_flag = 1
        h1 = (soup.find('h1').text).strip()
        " \n \t".join(h1.split())
        data['h1']="{}".format(h1)
    if not soup.find('h2'):
        h2 = ""
    else:
        h2 = (soup.find('h2').text).strip()
        " \n \t".join(h2.split())
        data['h2']="{}".format(h2)

    if not soup.find('h3'):
        h3 = ""
    else:
        heading_flag = 1
        h3 = (soup.find('h3').text).strip()
        " \n \t".join(h3.split())
        data['h3']="{}".format(h3)

    if not soup.find('h4'):
        h4 = ""
    else:
        heading_flag = 1
        h4 = (soup.find('h4').text).strip()
        " \n \t".join(h4.split())
        data['h4']="{}".format(h4)

    if not soup.find('h5'):
        h5 = ""
    else:
        heading_flag = 1
        h5 = (soup.find('h5').text).strip()
        " \n \t".join(h5.split())
        data['h5']="{}".format(h5)

    if not soup.find('h6'):
        h6 = ""
    else:
        heading_flag = 1
        h6 = (soup.find('h6').text).strip()
        " \n \t".join(h6.split())
        data['h6']="{}".format(h6)

    if heading_flag ==1:
        page_heading = json.dumps(data)
    else:
        page_heading = ""

    page_content(initUrl[0], page_title, page_desc, page_heading)

My problem is related to data variable, because when I pass soup.find('h6').text as page_heading variable I can store with correct encoding, and string is something like (زبان فارس - english) in mysql db not like (\u0628\u0631\u062e\u0648\u0631\u062f). I tried encode('utf8') but it was't useful. I've appreciate you for any help.

Update: My function to save into db:

def page_content(link_id, page_title, page_desc, page_heading):
    insQuery="INSERT IGNORE INTO ex_ctnt(cw_id, c_title, c_meta_desc, c_heading) VALUES(%s, %s, %s, %s)"
    if ((len(page_title)>0)):
        connection = pymysql.connect(host="localhost", user="root", passwd="kiuhddh87d83gfgfg", db="hiihh8y929g2")
        myquery = connection.cursor()
        myquery.execute(insQuery,(link_id, page_title, page_desc, page_heading))
        connection.commit()
        connection.close()
    else:
        print("problem with the length of page title or description (Not Inserted !)")

Are you saving JSON, because then `{"h1": "\u0628\u0631\u062e\u0648\u0631\u062f"}` is what you want. — juanpa.arrivillaga, Nov 02 '19 at 00:15
Dear @Barmar I tried to decode but I don't now how to do that for this case. — William Johnson, Nov 02 '19 at 00:27
Why are you using `json.dumps()` in the first place? Just do `page_heading = data` and later store `page_heading` in the DB. — Barmar, Nov 02 '19 at 00:28
Show your code where you're saving in the DB and we can help you fix it. — Barmar, Nov 02 '19 at 00:28
Dear @juanpa.arrivillaga, is this the nature of json ? is there any wat to decode before saving ? I don't want to decode everytime and I want to decode to original unicode before save to db. — William Johnson, Nov 02 '19 at 00:29
Dear @Barmar, I used json.dumps() because I want to make json structure without write more codes. — William Johnson, Nov 02 '19 at 00:32
In that case, what's the problem? The escape sequences are the correct way to put Unicode characters in JSON. — Barmar, Nov 02 '19 at 00:34
I want to store like normal variables without changing the data.(e.g. when I store just h4 = (soup.find('h4').text).strip()) — William Johnson, Nov 02 '19 at 00:45
I was worry about issue to decode with php and, everything was okay. "; echo $json['h1']; ?> — William Johnson, Nov 02 '19 at 01:09
@WilliamJohnson i'm not sure if I understand what you are asking. In any case, that is a valid representation of the correct unicode string in JSON. If you have a valid JSON parser, it will create the correct data structure where you deserialize it, which is presumably what you want. What, **exactly** is the problem? — juanpa.arrivillaga, Nov 02 '19 at 01:23

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

All non ascii chars are escaped in JSON strings.

So use ensure_ascii=False flag while dumping json (default is True)

data = {"h1": "زبان فارس"}
without_ascii_escape = json.dumps(data, ensure_ascii=False)
print(without_ascii_escape)
## Returns {"h1": "زبان فارس"}, should be the same in db as well (at least after reading from there)

From Python3.8 docstring

If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

Issue with python encoding and json.dumps()

1 Answers1

From Python3.8 docstring