0

I am using pandas to compare two large data frames that have been pulled from two Excel workbooks, and after merging and comparing differences, I noticed that the cells that didn't merge have unusual accents and trademark symbols that they did not have in the Excel workbooks. I surmise that these symbols are why the merges didn't occur, but I am not sure why they are showing up in the pandas dataframes or how to get rid of them/get them to be recognized.

I am not sure if this an encoding issue within Python, or an importing issue from Excel --> Python.

The symbols only occur in one column of the dataframe, called 'com_it'. I've tried to replace the symbols using the following, but it did not work:

com['com_it'].replace(u"\u00AE", '', inplace=True)
pano2['com_it'].replace(u"\u00AE", '', inplace=True)

Any help with how to get the symbols to disappear or otherwise get these cells merged would be greatly appreciated!

Sam
  • 209
  • 2
  • 11
  • Try to save as csv in excel then import as csv. Then can you share part of the excel? – Nurislom Rakhmatullaev Feb 25 '21 at 19:49
  • Thank you. I unfortunately can't share the Excel because it's proprietary information, but I saved in excel and then imported as csv. It still did not allow the cells to merge – Sam Feb 25 '21 at 20:08

2 Answers2

0

It is an issue with how pandas is handling and interpreting the character encoding of the excel file(s). Not sure what type of data is in ['com_it']. Try the two options below when you originally import the excel workbooks. I believe option 1 will solve your issue. If not, try option 2.

Option 1

Try defining the engine as openpyxl when creating the dataframe.

import pandas as pd 

com = pd.read_excel(excel_file_path, engine='openpyxl')
pano2 = pd.read_excel(excel_file_path, engine='openpyxl')

Make sure your pandas/numpy is up to date and that you have openpyxl installed.

pip install openpyxl

PyPi openpyxl

Option 2

Pass the character encoding when creating the dataframe file(s).

import pandas as pd 

com = pd.read_excel(excel_file_path, encoding='utf-8')
pano2 = pd.read_excel(excel_file_path, encoding='utf-8')
nahar
  • 41
  • 5
  • Thank you! I responded to your suggestions in an answer, because my response was too long to fit into a comment – Sam Feb 25 '21 at 21:46
  • hmm ok lets try a couple more things. Not sure if you have fixed yet. Try different encodings, ie encoding='cp1252', etc. If you don't want to keep guessing, use chardet to detect encoding: [link](https://stackoverflow.com/questions/15918314/how-to-detect-string-byte-encoding) Did pd.read_excel(file_path, engine='openpyxl') work? Last resort: loop through the com_it columns before merging: def changeencode(data, cols): for col in cols: data[col] = data[col].str.decode('iso-8859-1').str.encode('utf-8') return data – nahar Feb 26 '21 at 13:57
  • Thanks. 1. I have tried to encode the data frames differently, but it doesn't seem to work. In Option 2 above, the error message reads: read_excel() got an unexpected keyword argument 'encoding'. 2. The pd.read_excel(file_path, engine='openpyxl') did not produce any changes. 3. I'm relatively new to loops. This is what I wrote after loading the dataset, but it did not give any changes. I think I am missing something:# def changeencode(data, cols): # for col in cols: # data[col] = data[col].str.decode('iso-8859-1').str.encode('utf-8') # return data – Sam Feb 26 '21 at 15:01
0

Thank you so much for this! I have been using a tk interface to automatically pull files instead of input them directly into the code. When I ran Option 2 using this, it produced an error. I have pasted the code below. Is there another way to encode for UTF-8?

import pandas as pd
import numpy as np
import openpyxl
from openpyxl.workbook import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl import load_workbook
from openpyxl.worksheet.table import Table
import tkinter as tk
from tkinter import filedialog
from tkinter import messagebox
import xlrd
import test_functions as tf 
import string


## UI - Asking user for their input and output files
root = tk.Tk()
root.withdraw()
root.pano2 =  filedialog.askopenfilename(initialdir = "C:/",title = "Select the location of your Pano file", filetypes = (("Excel Files","*.xlsx"),("all files","*.*")))
root.com = filedialog.askopenfilename(initialdir = "C:/",title = "Select the location of your Com file", filetypes = (("Excel Files","*.xlsx"),("all files","*.*")))
root.save_location = filedialog.asksaveasfilename(initialdir = "C:/",title = "Input name of file to be saved", defaultextension = ".xlsx", filetypes = (("Excel Files","*.xlsx"),("all files","*.*")))

# ## Importing datasets
pano2 = pd.read_excel(root.pano2, encoding = 'utf-8') 
com = pd.read_excel(root.com, encoding = 'utf-8')
Sam
  • 209
  • 2
  • 11