python display unicode in html

Question

I'm writing script to export my links and their titles from chrome to html.
Chrome bookmarks stored as json, in utf encoding
Some titles are on Russian therefore they stored like that:
"name": "\u0425\u0430\u0431\u0440\ ..."

import codecs
f = codecs.open("chrome.json","r", "utf-8")
data = f.readlines()

urls = [] # for links
names = [] # for link titles

ind = 0

for i in data:
    if i.find('"url":') != -1:
        urls.append(i.split('"')[3])
        names.append(data[ind-2].split('"')[3])
    ind += 1

fw = codecs.open("chrome.html","w","utf-8")
fw.write("<html><body>\n")
for n in names:
    fw.write(n + '<br>')
    # print type(n) # this will return <type 'unicode'> for each url!
fw.write("</body></html>")

Now, in chrome.html I got those displayed as \u0425\u0430\u0431...
How I can turn them back to Russian?
using python 2.5

Edit: Solved!

s = '\u041f\u0440\u0438\u0432\u0435\u0442 world!'
type(s)
<type 'str'>

print s.decode('raw-unicode-escape').encode('utf-8')
Привет world!

That's what I needed, to convert str of \u041f... into unicode.

f = open("chrome.json", "r")
data = f.readlines()
f.close()

urls = [] # for links
names = [] # for link titles

ind = 0

for i in data:
    if i.find('"url":') != -1:
        urls.append(i.split('"')[3])
        names.append(data[ind-2].split('"')[3])
    ind += 1

fw = open("chrome.html","w")
fw.write("<html><body>\n")
for n in names:
    fw.write(n.decode('raw-unicode-escape').encode('utf-8') + '<br>')
fw.write("</body></html>")

For Python 3 use: s.encode('utf-8').decode('raw-unicode-escape') — Demiurg, Feb 06 '18 at 11:28

John Machin · Accepted Answer · 2011-02-27T01:30:29.880

By the way, it's not just Russian; non-ASCII characters are quite common in page names. Example:

name=u'Python Programming Language \u2013 Official Website'
url=u'http://www.python.org/'

As an alternative to fragile code like

urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
# (1) relies on name being 2 lines before url
# (2) fails if there is a `"` in the name
# example: "name": "The \"Fubar\" website",

you could process the input file using the json module. For Python 2.5, you can get simplejson.

Here's a script that emulates yours:

try:
    import json
except ImportError: 
    import simplejson as json
import sys

def convert_file(infname, outfname):

    def explore(folder_name, folder_info):
        for child_dict in folder_info['children']:
            ctype = child_dict.get('type')
            name = child_dict.get('name')
            if ctype == 'url':
                url = child_dict.get('url')
                # print "name=%r url=%r" % (name, url)
                fw.write(name.encode('utf-8') + '<br>\n')
            elif ctype == 'folder':
                explore(name, child_dict)
            else:
                print "*** Unexpected ctype=%r ***" % ctype

    f = open(infname, 'rb')
    bmarks = json.load(f)
    f.close()
    fw = open(outfname, 'w')
    fw.write("<html><body>\n")
    for folder_name, folder_info in bmarks['roots'].iteritems():
        explore(folder_name, folder_info)
    fw.write("</body></html>")
    fw.close()    

if __name__ == "__main__":
    convert_file(sys.argv[1], sys.argv[2])

Tested using Python 2.5.4 on Windows 7 Pro.

score 1 · Answer 2 · answered Feb 27 '11 at 15:09

It's a JSON file, so read it using a JSON parser. That will give you a Unicode string directly, without you having to unescape it. This is going to be much more reliable (as well as simpler), since JSON strings are not the same format as Python strings.

(They're pretty similar and both use the \u format, but your current code will fall over badly for other escaped characters, not to mention that it relies on the exact attribute order and whitespace settings of a JSON file, which makes it very fragile indeed.)

import json, cgi, codecs

with open('chrome.json') as fp:
    bookmarks= json.load(fp)

with codecs.open('chrome.html', 'w', 'utf-8') as fp:
    fp.write(u'<html><body>\n')
    for root in bookmarks[u'roots'].values():
        for child in root['children']:
            fp.write(u'<a href="%s">%s</a>' % (
                cgi.escape(child[u'url']),
                cgi.escape(child[u'name'])
            ))
    fp.write(u'</body></html>')

Note also the use of cgi.escape to HTML-encode any < or & characters in the strings.

score 0 · Answer 3 · answered Feb 26 '11 at 16:29

I'm not sure where you're trying to display the russian text, but in the interpreter you can do the following to see the Russian text:

s = '\u0425\u0430\u0431'
l = s.split('\u')
l.remove('')
for x in l:
    print(unichr(int(x, 16))),

This will give the following output:

Х а б

If you're storing it in html, better off to leave it as '\u0425...' until you need to convert it.

Hope this helps.

wisty · Answer 4 · 2011-02-26T16:49:00.867

0

You could include the utf-8 BOM, so chrome knows to read it as utf-8, not ascii:

fw = codecs.open("chrome.html","w","utf-8")
fw.write(codecs.BOM_UTF8.decode('utf-8'))
fw.write(u'你好')

Oh, but if you open fw in python, remember to use 'utf-8-sig' to strip the BOM.

Maybe you need to encode the unicode into utf-8, but I think codecs does that already, right:

edited Feb 26 '11 at 16:49

answered Feb 26 '11 at 16:42

wisty

6,981
1
30
29

python display unicode in html

Edit: Solved!

4 Answers4

Linked

python display unicode in html

**Edit: Solved!**

4 Answers4

Linked

Edit: Solved!