Force computation of calculated columns in .xlsx imported via pd.ExcelFile

Question

Suppose my data.xlsx's first sheet contains some computed columns.

I'm trying to pull out a pd.DataFrame of that sheet that holds the computed values.

But try as I may, I cannot achieve this.

Fails:

# > pip install openpyxl
import pandas as pd

pd.read_excel(f'data.xlsx', 'firstSheetName')
# NOTE: Adding `, engine='openpyxl'` makes no difference

df_nodal.head()

This gives NaN in all calculate fields.

xl = pd.ExcelFile(f'data.xlsx')
df = xl.parse('firstSheetName')
df.head()

Same.

how to read xlsx as pandas dataframe with formulas as strings

from openpyxl import load_workbook

wb = load_workbook(filename = f'data.xlsx')
ws = wb['mySheetName']
df = pd.DataFrame(ws.values)
df.head()

Now this is giving the formulae: =H2, =H3 etc. in the cells.

An attempt to 'type-convert' these colums failed:

df[12][2:].astype(float)
# ValueError: could not convert string to float: '=H3'

How to force pandas to evaluate formulas of xlsx and not read them as NaN? might offer a solution, which involves saving and reloading the .xlsx. However I can't get it working. That syntax appears invalid.

import pandas as pd, xlwings as xw
def df_from_excel(path):
    book = xw.Book(path)
    book.save()
    return pd.read_excel(path,header=0)
df = df_from_excel('nodal0.xlsx')

This gives XlwingsError: Make sure to have "appscript" and "psutil", dependencies of xlwings, installed.

And pip install appscript psutil says they're both already installed.

Note: Same idea here: Pandas read_excel with formulas and get values

I'm trying to find a way for it to render into a dataframe, which will then contain numeric values.

Is there any way to do it?

EDIT:

Here's what I'm dealing with:

The raw .xlsx is shown below. I've double-clicked a calculated cell revealing the underlying =H2.

Notice the corresponding cell of the dataframe (generated from this .xlsx) is showing NaN

Did you try using [`pandas.read_excel`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)? If not, see if [this](https://stackoverflow.com/a/64986290/7109869) is of help. — Gonçalo Peres, Oct 06 '22 at 12:30
What about `df = pd.read_excel(path, engine='openpyxl')`, where `path = r'C:\Users\Username\Desktop\nodal0.xlsx'` (change to your specific path)? — Gonçalo Peres, Oct 06 '22 at 13:21
In order to see if I can help would need to have access to the file. I've built one random file to test locally, and specifying the engine in `pd.read_excel` worked well. — Gonçalo Peres, Oct 06 '22 at 13:39
Alas the data is the IP of a client. Also it is a 100MB file. I've added a screenshot to the bottom of the question that shows what I'm dealing with. — P i, Oct 06 '22 at 15:59
@CharlieClark How to implement? `df_nodal[12][2:].astype(float)` gives `ValueError: could not convert string to float: '=H3'` — P i, Oct 07 '22 at 16:23

Force computation of calculated columns in .xlsx imported via pd.ExcelFile

0 Answers0