I have searched and searched. I can't exactly find an issue quite like mine. I did try.
I have read Parquet data into a Pandas dataframe and used .query statement to filter data.
import pandas as pd
import fastparquet as fp
fieldsToInclude = ['ACCURACY','STATE','LOCATION','COUNTRY_CODE']
criteria = 'ACCURACY == 1.0 or COUNTRY_CODE == "AD"'
pandaParqFile = fp.ParquetFile(fn = inputPath + "World Zip Code.parquet")
newDF = pandaParqFile.to_pandas()
dataset = newDF[fieldsToInclude]
extraction = dataset.query(criteria)
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
print(extraction)
When it prints, I get UnicodeEncodeError: 'charmap' codec can't encode error 'u\0310' in position 4174: character maps to undefined'. This is in Geany. I get a different character and position if I print from the administrator console. I'm running Windows 7. The data does have characters that are Latin, German, etc.
I'm actually seeing some special characters when I print the data to the screen using other criteria for .query, so I guess it's only certain characters? I looked up 'u\0310' and that's some sort of Latin i. But I can print other Latin characters.
I've tried some suggestions for trying to resolve this with specifying encoding, but they didn't seem to work because this is a dataframe. Other questions I came across were about this error occurring when trying to open CSV files. Not what I'm experiencing here.
The zip code data is just something to work with to learn Pandas. In the future, there's no telling what kind of data will be processed by this script. I'm really looking for a solution to this problem that will prevent it from happening regardless of what kinds of characters the data will have. Simply removing the LOCATION field, which is where all of these special characters are for this particular data, isn't viable.
Has anyone seen this before? Thanks in advance.