1

I am trying to conduct hierarchical clustering through Japanese words/terms and using scipy.cluster.hierarchy.dendrogram to plot the results. However, the plot cannot show the Japanese words/terms but instead use small rectangles. At first, I was thinking this may be because when I create the dictionary, the keys are unicode not Japanese (as the question I asked here). Then I was suggested to use Python3 to solve such issue and I finally make the dictionary key in Japanese words instead of unicode (as the question I ask here). However, it turns out that even if I feed the label parameter of scipy.cluster.hierarchy.dendrogram with Japanese words/terms, the plot still cannot show those words. I have checked several similar posts but it seems like there is still no clear solution. My codes are as follows:

import pandas as pd
import numpy as np
from sklearn import decomposition
from sklearn.cluster import AgglomerativeClustering as hicluster
from scipy.spatial.distance import cdist, pdist
from scipy import sparse as sp ## Sparse Matrix
from scipy.cluster.hierarchy import dendrogram
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')

## Import Data
allWrdMat10 = pd.read_csv("../../data/allWrdMat10.csv.gz", 
    encoding='CP932')

## Set X as CSR Sparse Matrix 
X = np.array(allWrdMat10)
X = sp.csr_matrix(X)

def plot_dendrogram(model, **kwargs):
    # Children of hierarchical clustering
    children = model.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one 
      for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0]+2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, 
        no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

dict_index = {t:i for i,t in enumerate(allWrdMat10.columns)}

dictlist = []
temp = []
akey = []
avalue = []

for key, value in dict_index.items():
    akey.append(key)
    avalue.append(value)
    temp = [key,value]
    dictlist.append(temp)

avalue = np.array(avalue)

X_transform = X[:, avalue < 1000].transpose().toarray()

freq1000terms = akey
freq1000terms = np.array(freq1000terms)[avalue < 1000]

hicl_ward = hicluster(n_clusters=40,linkage='ward', compute_full_tree = 
    False)
hiclwres = hicl_ward.fit(X_transform)

plt.rcParams["figure.figsize"] = (15,6)

model1 = hiclwres
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plot_dendrogram(model1, p = 40, truncate_mode = 'lastp', orientation = 
    'top', labels=freq1000terms[model1.labels_], color_threshold = 991)
plt.ylim(959,1000)
plt.show()
Cœur
  • 37,241
  • 25
  • 195
  • 267
tzu
  • 183
  • 1
  • 9

1 Answers1

1

You need to give matplotlib a valid font to display Japanese characters with. You can find the available fonts from your system by using the following code:

import matplotlib.font_manager
matplotlib.font_manager.findSystemFonts(fontpaths=None)

It will give you a list of system fonts that matplotlib can use:

['c:\\windows\\fonts\\seguisli.ttf',
 'C:\\WINDOWS\\Fonts\\BOD_R.TTF',
 'C:\\WINDOWS\\Fonts\\GILC____.TTF',
 'c:\\windows\\fonts\\segoewp-light.ttf',
 'c:\\windows\\fonts\\glsnecb.ttf',
 ...
 ...
 'c:\\windows\\fonts\\elephnti.ttf',
 'C:\\WINDOWS\\Fonts\\COPRGTB.TTF']

Pick a font that supports Japanese character encoding, and give it as a parameter to matplotlib at the beginning of your code as following:

import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "Yu Gothic" # I.E Yu Gothic, supports shift-jis

This is a global parameter setting, other plots on the same project will also use the same font family. If you want to change it for a single text, you can use font properties of matplotlib text object.


Also: If you can't find/see an appropriate font you can download a font like code2000, install it and use it the same way. (For the font to show up at the list, you may need to clear matplotlib's cache)

umutto
  • 7,460
  • 4
  • 43
  • 53
  • Thanks for your answer. @umutto, do I need to specify the path? – tzu Jun 06 '17 at 06:53
  • @tzu I believe matplotlib can automatically find the system font's path from its name but it should work either way. You can just copy the path from the list font_manager gave you just to be safe. – umutto Jun 06 '17 at 06:54
  • I found that only the fonts start with `/Library/Fonts/foo.ttf` are found by matplotlib but other fonts starting with `/usr/X11R6/lib/X11/fonts/TTF/foo.ttf`, `/System/Library/Fonts/foo.ttf`, or `/Users/username/Library/Fonts/foo.ttf` are not found by it with the error message as `/usr/local/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['foo'] not found. Falling back to DejaVu Sans `. Could you help me with this? (I use Mac, Python3, and Jupyter) Thank you so much! – tzu Jun 06 '17 at 07:17
  • @tzu Sorry, I'm not very familiar with mac. But it seems weird that fonts found by `findSystemFonts()` is giving that error, maybe clear / check the font cache. (you can find the location using `matplotlib.get_cachedir()`) or check for other answers on how to set the font in matplotlib. – umutto Jun 06 '17 at 07:43
  • No worry, I will try to figure out. Or, can anybody help? – tzu Jun 06 '17 at 08:02
  • I think I have the answer. I go to the folder of `matplotlib` and open the file of `font_manager.py`, and further add the path of the fonts which supports Japanese (`font_manager.py` has 4 default paths but only one of them covers the path in my laptop). – tzu Jun 06 '17 at 08:47
  • @tzu Good thinking, looking at the `font_manager.py` It seems like you can try different methods of setting the font path. I've tried `FontProperties.set_file` for an arbitrary path and it worked. I'm sure you can find plenty of other methods there as well, Good luck. – umutto Jun 06 '17 at 08:59