1

[ 10-07-2022 - For anyone stopping by with the same issue. After much searching, I have yet to find a way, that isn't convoluted and complicated, to accurately pull mixed type data from excel using Pandas/Python. My solution is to convert the files using unoconv on the command line, which preserves the formatting, then read into pandas from there. ]

I have to concatenate 1000s of individual excel workbooks with a single sheet, into one master sheet. I use a for loop to read them into a data frame, then concatenate the data frame to a master data frame. There is one column in each that could represent currency, percentages, or just contain notes. Sometimes it has been filled out with explicit indicators in the cell, Eg., '$' - other times, someone has used cell formatting to indicate currency while leaving just a decimal in the cell. I've been using a formatting routine to catch some of this but have run into some edge cases.

Consider a case like the following:

In the actual spreadsheet, you see: $0.96

When read_excel siphons this in, it will be represented as 0.96. Because of the mixed-type nature of the column, there is no sure way to know whether this is 96% or $0.96

Is there a way to read excel files into a data frame for analysis and record what is visually represented in the cell, regardless of whether cell formatting was used or not?

I've tried using dtype="str", dtype="object" and have tried using both the default and openpyxl engines.

UPDATE Taking the comments below into consideration, I'm rewriting with openpyxl.

import openpyxl
from openpyxl import load_workbook

def excel_concat(df_source):
    df_master = pd.DataFrame()
        
    for index, row in df_source.iterrows():
        excel_file = Path(row['Test Path']) / Path(row['Original Filename'])
    
        wb = openpyxl.load_workbook(filename = excel_file)
        ws = wb.active
    
        df_data = pd.DataFrame(ws.values)
    
        df_master = pd.concat([df_master, df_data], ignore_index=True)
    
    return df_master

df_master1 = excel_concat(df_excel_files)

This appears to be nothing more than a "longcut" to just calling the openpyxl engine with pandas. What am I missing in order to capture the visible values in the excel files?

EA Bubnoff
  • 195
  • 2
  • 11

1 Answers1

0

looking here,https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html , noticed the following

dtypeType name or dict of column -> type, default None

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. **If converters are specified, they will be applied INSTEAD of dtype conversion.**

converters dict, default None

Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

Do you think that might work for you?

JMS
  • 1
  • 2
  • Afraid not. The issue is that `read_excel` does not seem to pick up cell formatting, just the cell content. So it will require some interpretation in cases like 0.96. Converters and `dtype` settings are limited by this as well. – EA Bubnoff Oct 05 '22 at 18:40