Encoding Unicode in the Dictionary Key to Japanese

Question

I just started working on text clustering in Japanese through Python2. However, when I created the dictionary based on these Japanese words/terms, the dictionary keys become unicode instead of Japanese. The codes are as follows:

# load data
allWrdMat10 = pd.read_csv("../../data/allWrdMat10.csv.gz", 
encoding='CP932') 


## Set X as CSR Sparse Matrix
X = np.array(allWrdMat10)
X = sp.csr_matrix(X)

## create dictionary
dict_index = {t:i for i,t in enumerate(allWrdMat10.columns)}

freqrank = np.array(dict_index.values()).argsort()
X_transform = X[:, freqrank < 1000].transpose().toarray()

The results of allWrdMat10.columns are still Japanese as follows:

Index([u'?', u'．', u'・', u'％', u'０', u'１', u'１０月', u'１１月', u'１２
月', u'１つ',
...
u'瀋陽', u'疆', u'盧', u'籠', u'絆', u'胚', u'諫早', u'趙', u'鉉', u'鎔
基'],dtype='object', length=8655)

However, the results of dict_index.keys() are as:

[u'\u77ed\u9283',
 u'\u5efa\u3066',
 u'\u4f0a',
 u'\u5e73\u5b89',
 u'\u6025\u9a30',
 u'\u897f\u65e5\u672c',
 u'\u5e03\u9663',
 ...]

Is there any way I can keep the Japanese words/terms in the dictionary keys? Or is there any way I can convert the unicodes back to Japanese words/terms? Thanks.

score 1 · Answer 1 · answered Jun 05 '17 at 08:21

When you ask the interpreter for the value of an expression it computes the value and then outputs its repr(). The print statement (v2) or function (v3) uses the str() of the value. So if I take one of the problematic keys and ask my interpreter what its value is I get what you see. If I print it, however, I see the required Japanese characters:

>>> u'\u77ed\u9283'
u'\u77ed\u9283'
>>> print u'\u77ed\u9283'
短銃

So you do have the values you need, you just didn't understand that the interpreter was using a different representation, guaranteed to be representable in ASCII.

Thanks for this explanation. However, when I set `labels=dict_index.keys()` for the function `plot_dendrogram`, the plot cannot show the words. This is why I am trying to convert the unicode to Japanese terms or keep it unchanged when I create the dictionary. — tzu, Jun 05 '17 at 08:28

khelili miliana · Accepted Answer · 2017-06-06T11:37:22.643

0

You did not prefix the string with u, which is needed in Python 2. Even better, unicode_literals import unicode_literals

edited Jun 06 '17 at 11:37

answered Jun 05 '17 at 10:01

khelili miliana

3,730
2
15
28

Thank you. @KHELILI Hamza, could you please provide more details regarding the process? – tzu Jun 05 '17 at 22:24
@tzu this help https://stackoverflow.com/questions/809796/any-gotchas-using-unicode-literals-in-python-2-6 – khelili miliana Jun 06 '17 at 11:36
@tzu if your code is working good, you don't forget to accept my answer please – khelili miliana Jun 06 '17 at 11:38

Encoding Unicode in the Dictionary Key to Japanese

2 Answers2

Linked