0

I have searched and searched. I can't exactly find an issue quite like mine. I did try.

I have read Parquet data into a Pandas dataframe and used .query statement to filter data.

import pandas as pd
import fastparquet as fp

fieldsToInclude = ['ACCURACY','STATE','LOCATION','COUNTRY_CODE']

criteria = 'ACCURACY == 1.0 or COUNTRY_CODE == "AD"'

pandaParqFile = fp.ParquetFile(fn = inputPath + "World Zip Code.parquet")
newDF = pandaParqFile.to_pandas()

dataset = newDF[fieldsToInclude]

extraction = dataset.query(criteria)

with pd.option_context('display.max_rows', 100, 'display.max_columns', 10): 
    print(extraction)

When it prints, I get UnicodeEncodeError: 'charmap' codec can't encode error 'u\0310' in position 4174: character maps to undefined'. This is in Geany. I get a different character and position if I print from the administrator console. I'm running Windows 7. The data does have characters that are Latin, German, etc.

I'm actually seeing some special characters when I print the data to the screen using other criteria for .query, so I guess it's only certain characters? I looked up 'u\0310' and that's some sort of Latin i. But I can print other Latin characters.

I've tried some suggestions for trying to resolve this with specifying encoding, but they didn't seem to work because this is a dataframe. Other questions I came across were about this error occurring when trying to open CSV files. Not what I'm experiencing here.

The zip code data is just something to work with to learn Pandas. In the future, there's no telling what kind of data will be processed by this script. I'm really looking for a solution to this problem that will prevent it from happening regardless of what kinds of characters the data will have. Simply removing the LOCATION field, which is where all of these special characters are for this particular data, isn't viable.

Has anyone seen this before? Thanks in advance.

  • 1
    This is very likely a duplicate. Search SO for `[python] UnicodeEncodeError: 'charmap' codec can't encode error` and one gets 400 hits, starting with questions with similar titles. The fact that the data being displayed comes from a particular source is irrelevant. It is where it goes to. There is an encoding mismatch somewhere. – Terry Jan Reedy Mar 01 '18 at 20:33
  • In Python 3.6, the handling of non-ascii output to the terminal was improved. – Terry Jan Reedy Mar 01 '18 at 20:35
  • As @TerryJanReedy said, this is probably entirely unrelated to `pandas`, as the exception is caused by the `print()` expression. Try running Python from a commandline that supports UTF-8, or avoid `print()` expressions (and write to files opened with UTF-8 or UTF-16 encoding instead). – lenz Mar 01 '18 at 21:30
  • Yes, there are lots of results, and one gets sick of reading through them for hours and not finding solutions. I'll try updating Python. Failing that, I guess I'm going to have to try to find something besides print statements to check output. I'm using UTF-8, so I don't get why this is such a problem. –  Mar 03 '18 at 16:48
  • Though it's an old thread, but would refer to https://stackoverflow.com/a/43989185/282155 which suggests `export PYTHONIOENCODING=UTF-8` before executing the python script in console. – Kaushik Acharya Nov 14 '20 at 14:22

1 Answers1

0

You need to specify utf-8 as encoding format.

Try:

with pd.option_context('display.encoding', 'UTF-8', 'display.max_rows', 100, 'display.max_columns', 10): print(extraction)