2

Context: I'm trying to build a seaborn heatmap in order to map the following type of data (in a dataframe):

enter image description here

(This can be up to 50 fruits and 5500 stores)

My problem (I think) is that seaborn appears to want to use ascii but my data is in utf-8. When I read the csv file, I can't do the following:

df = pd.read_csv('data.csv', encoding = 'ascii')

without getting the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)

When I bring it in using utf-8, it will read and I can reshape it to heatmap-friendly form but then when trying to run:

sns.heatmap(df2)

I get a similar UnicodeDecodeError

I do have simple special characters (colons, backspaces, etc.) in either my store or fruit fields so I'm wondering what the best approach is here.

  • Should I run something on my dataframe to remove the utf-8 character then encode in ascii?
  • Should I be doing something to my source .csv file to remove the utf-8 characters?
  • Can I run seaborn another way to let is accept the encoding I have?

If anybody has a preferred method, can they help me with the proper code to get it done?

Python version 2.7.12 :: Anaconda 4.1.1 (64-bit) Pandas (0.18.1) Seaborn (0.7.1)

Anjan G
  • 79
  • 1
  • 1
  • 8
  • Are you using Python 2.7 or Python 3.x? Which versions of Seaborn and Pandas are you using? – Schmuddi Oct 07 '16 at 14:53
  • Also, to clarify: You say that if you run ``sns.heatmap(df)``, you get the same UnicodeDecodeError as for your ``read_csv('data.csv', encoding='ascii')`` command. How is that dataframe ``df`` created if the ``pd.read_csv()`` command failed? – Schmuddi Oct 07 '16 at 15:00
  • what happens when you do pd.read_csv('data.csv', encoding = 'utf8') – Zeugma Oct 07 '16 at 15:01
  • @Schmuddi: Edited the question to include versions. Also I should have clarified but, the df from part 1 was brought in using the default encoding then reshaped into a heatmap-friendly format. – Anjan G Oct 07 '16 at 15:06
  • @Boud - works fine but then I get an error after reshaping and trying to run it through the heatmap function. Editing question to clarify – Anjan G Oct 07 '16 at 15:07
  • Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) – Kevin J. Chase Oct 07 '16 at 15:57

1 Answers1

1

Your configuration (Python 2.7, Pandas 0.18.1, Seaborn 0.7.1) should certainly be able to handle utf-8. Even if the font used in the plot doesn't support these unicode characters, the heatmap should still be displayed. Here is a test case:

import pandas as pd
import seaborn as sns

df = pd.DataFrame(
        {'Fruit': ['Apple', 'Banana', 'Orange', 'Kiwi'] * 2,
        'Store': [u'Kr\xf6ger'] * 4 + [u'F\u0254\u0254d Li\u01ebn'] * 4,
        'Stock': [6, 1, 3, 4, 1, 7, 7, 9]})

sns.heatmap(df.pivot("Fruit", "Store", "Stock"))

The problem, therefore, is somewhere in your data frame df2. Your comment states that df2 is created by reshaping another data frame, probably also by something like pivot() or crosstab().

Let's assume that this original data frame contains the columns Store and Fruit, and that it was read from your file like so, i.e. with default encoding:

raw = pd.read_csv('data.csv')

For testing, this is the content of that file data.csv:

Store,Fruit,Stock
Kröger,Apple,6
Kröger,Banana,1
Kröger,Orange,3
Kröger,Kiwi,4
Fɔɔd Liǫn,Apple,1
Fɔɔd Liǫn,Banana,7
Fɔɔd Liǫn,Orange,7
Fɔɔd Liǫn,Kiwi,9

Now, in order to fix the encoding of columns Store and Fruit so that they contain valid Unicode strings, use the decode() string method, like so:

raw["Store"] = raw["Store"].apply(lambda x: x.decode("utf-8"))
raw["Fruit"] = raw["Fruit"].apply(lambda x: x.decode("utf-8"))

Now, heatmap() should work happily with the data frame:

sns.heatmap(raw.pivot("Fruit", "Store", "Stock"))
Schmuddi
  • 1,995
  • 21
  • 35
  • Is there a reason that doing `df['Store'] = df['Store'].apply(lambda x: x.decode('utf-8')` gives me `'ascii' codec can't encode character u'\xf6' in position 2: ordinal not in range(128)` ? – Anjan G Oct 07 '16 at 18:42
  • Also, you're right about the reshaping. Specifically, I followed the instructions here: http://stackoverflow.com/a/39899507/6210012 – Anjan G Oct 07 '16 at 18:44
  • @AnjanG: ``u\xf6`` is the Unicode for ``ö``. This means that if you get this error when trying to decode the strings in ``Store``, in all likelihood the values in that column are already valid Unicode strings. If so, what happens if you skip this command? Any success with the heatmap? I'm afraid that without a sample data file that produces your error, there's not much left that I can do to help you. – Schmuddi Oct 07 '16 at 21:31