0

I just started working on text clustering in Japanese through Python2. However, when I created the dictionary based on these Japanese words/terms, the dictionary keys become unicode instead of Japanese. The codes are as follows:

# load data
allWrdMat10 = pd.read_csv("../../data/allWrdMat10.csv.gz", 
encoding='CP932') 


## Set X as CSR Sparse Matrix
X = np.array(allWrdMat10)
X = sp.csr_matrix(X)

## create dictionary
dict_index = {t:i for i,t in enumerate(allWrdMat10.columns)}

freqrank = np.array(dict_index.values()).argsort()
X_transform = X[:, freqrank < 1000].transpose().toarray()

The results of allWrdMat10.columns are still Japanese as follows:

Index([u'?', u'.', u'・', u'%', u'0', u'1', u'10月', u'11月', u'12
月', u'1つ',
...
u'瀋陽', u'疆', u'盧', u'籠', u'絆', u'胚', u'諫早', u'趙', u'鉉', u'鎔
基'],dtype='object', length=8655)

However, the results of dict_index.keys() are as:

[u'\u77ed\u9283',
 u'\u5efa\u3066',
 u'\u4f0a',
 u'\u5e73\u5b89',
 u'\u6025\u9a30',
 u'\u897f\u65e5\u672c',
 u'\u5e03\u9663',
 ...]

Is there any way I can keep the Japanese words/terms in the dictionary keys? Or is there any way I can convert the unicodes back to Japanese words/terms? Thanks.

tzu
  • 183
  • 1
  • 9

2 Answers2

1

When you ask the interpreter for the value of an expression it computes the value and then outputs its repr(). The print statement (v2) or function (v3) uses the str() of the value. So if I take one of the problematic keys and ask my interpreter what its value is I get what you see. If I print it, however, I see the required Japanese characters:

>>> u'\u77ed\u9283'
u'\u77ed\u9283'
>>> print u'\u77ed\u9283'
短銃

So you do have the values you need, you just didn't understand that the interpreter was using a different representation, guaranteed to be representable in ASCII.

holdenweb
  • 33,305
  • 7
  • 57
  • 77
  • Thanks for this explanation. However, when I set `labels=dict_index.keys()` for the function `plot_dendrogram`, the plot cannot show the words. This is why I am trying to convert the unicode to Japanese terms or keep it unchanged when I create the dictionary. – tzu Jun 05 '17 at 08:28
0

You did not prefix the string with u, which is needed in Python 2. Even better, unicode_literals import unicode_literals

khelili miliana
  • 3,730
  • 2
  • 15
  • 28