How to Make Scipy Dendrogram Read Japanese Words/Terms

Question

I am trying to conduct hierarchical clustering through Japanese words/terms and using scipy.cluster.hierarchy.dendrogram to plot the results. However, the plot cannot show the Japanese words/terms but instead use small rectangles. At first, I was thinking this may be because when I create the dictionary, the keys are unicode not Japanese (as the question I asked here). Then I was suggested to use Python3 to solve such issue and I finally make the dictionary key in Japanese words instead of unicode (as the question I ask here). However, it turns out that even if I feed the label parameter of scipy.cluster.hierarchy.dendrogram with Japanese words/terms, the plot still cannot show those words. I have checked several similar posts but it seems like there is still no clear solution. My codes are as follows:

import pandas as pd
import numpy as np
from sklearn import decomposition
from sklearn.cluster import AgglomerativeClustering as hicluster
from scipy.spatial.distance import cdist, pdist
from scipy import sparse as sp ## Sparse Matrix
from scipy.cluster.hierarchy import dendrogram
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')

## Import Data
allWrdMat10 = pd.read_csv("../../data/allWrdMat10.csv.gz", 
    encoding='CP932')

## Set X as CSR Sparse Matrix 
X = np.array(allWrdMat10)
X = sp.csr_matrix(X)

def plot_dendrogram(model, **kwargs):
    # Children of hierarchical clustering
    children = model.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one 
      for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0]+2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, 
        no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

dict_index = {t:i for i,t in enumerate(allWrdMat10.columns)}

dictlist = []
temp = []
akey = []
avalue = []

for key, value in dict_index.items():
    akey.append(key)
    avalue.append(value)
    temp = [key,value]
    dictlist.append(temp)

avalue = np.array(avalue)

X_transform = X[:, avalue < 1000].transpose().toarray()

freq1000terms = akey
freq1000terms = np.array(freq1000terms)[avalue < 1000]

hicl_ward = hicluster(n_clusters=40,linkage='ward', compute_full_tree = 
    False)
hiclwres = hicl_ward.fit(X_transform)

plt.rcParams["figure.figsize"] = (15,6)

model1 = hiclwres
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plot_dendrogram(model1, p = 40, truncate_mode = 'lastp', orientation = 
    'top', labels=freq1000terms[model1.labels_], color_threshold = 991)
plt.ylim(959,1000)
plt.show()

umutto · Accepted Answer · 2017-06-06T07:46:49.307

1

You need to give matplotlib a valid font to display Japanese characters with. You can find the available fonts from your system by using the following code:

import matplotlib.font_manager
matplotlib.font_manager.findSystemFonts(fontpaths=None)

It will give you a list of system fonts that matplotlib can use:

['c:\\windows\\fonts\\seguisli.ttf',
 'C:\\WINDOWS\\Fonts\\BOD_R.TTF',
 'C:\\WINDOWS\\Fonts\\GILC____.TTF',
 'c:\\windows\\fonts\\segoewp-light.ttf',
 'c:\\windows\\fonts\\glsnecb.ttf',
 ...
 ...
 'c:\\windows\\fonts\\elephnti.ttf',
 'C:\\WINDOWS\\Fonts\\COPRGTB.TTF']

Pick a font that supports Japanese character encoding, and give it as a parameter to matplotlib at the beginning of your code as following:

import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "Yu Gothic" # I.E Yu Gothic, supports shift-jis

This is a global parameter setting, other plots on the same project will also use the same font family. If you want to change it for a single text, you can use font properties of matplotlib text object.

Also: If you can't find/see an appropriate font you can download a font like code2000, install it and use it the same way. (For the font to show up at the list, you may need to clear matplotlib's cache)

edited Jun 06 '17 at 07:46

answered Jun 06 '17 at 06:38

umutto

7,460
4
43
53

Thanks for your answer. @umutto, do I need to specify the path? – tzu Jun 06 '17 at 06:53
@tzu I believe matplotlib can automatically find the system font's path from its name but it should work either way. You can just copy the path from the list font_manager gave you just to be safe. – umutto Jun 06 '17 at 06:54
I found that only the fonts start with `/Library/Fonts/foo.ttf` are found by matplotlib but other fonts starting with `/usr/X11R6/lib/X11/fonts/TTF/foo.ttf`, `/System/Library/Fonts/foo.ttf`, or `/Users/username/Library/Fonts/foo.ttf` are not found by it with the error message as `/usr/local/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['foo'] not found. Falling back to DejaVu Sans `. Could you help me with this? (I use Mac, Python3, and Jupyter) Thank you so much! – tzu Jun 06 '17 at 07:17
@tzu Sorry, I'm not very familiar with mac. But it seems weird that fonts found by `findSystemFonts()` is giving that error, maybe clear / check the font cache. (you can find the location using `matplotlib.get_cachedir()`) or check for other answers on how to set the font in matplotlib. – umutto Jun 06 '17 at 07:43
No worry, I will try to figure out. Or, can anybody help? – tzu Jun 06 '17 at 08:02
I think I have the answer. I go to the folder of `matplotlib` and open the file of `font_manager.py`, and further add the path of the fonts which supports Japanese (`font_manager.py` has 4 default paths but only one of them covers the path in my laptop). – tzu Jun 06 '17 at 08:47
@tzu Good thinking, looking at the `font_manager.py` It seems like you can try different methods of setting the font path. I've tried `FontProperties.set_file` for an arbitrary path and it worked. I'm sure you can find plenty of other methods there as well, Good luck. – umutto Jun 06 '17 at 08:59

How to Make Scipy Dendrogram Read Japanese Words/Terms

1 Answers1