0

I have a large data set that I pull into pandas with read-excel. I use the data to create a new column then write the new column to Excel with openpyxl. The issue is that if I read that file again, the original data will be read as blank values. The only way around this I have found so far is to re-write the original array. Why does this happen?

Update 1: This is true even if I change the pd.read_excel(engine='openpyxl')

Update 2: To clarify here are examples:

test.xlsm

test.xlsm

import pandas as pd
import openpyxl as xl
from openpyxl.utils.dataframe import dataframe_to_rows

df = pd.read_excel('test.xlsm', engine='openpyxl')

print(df)

df['Col_4'] = df['Col_3'].apply(lambda x: x*2)

wb = xl.load_workbook('test.xlsm')
ws = wb.active
df = df['Col_4'].to_frame()
rowsdf = dataframe_to_rows(df, index = False, header = False)
for row_indexdf, rowdf in enumerate(rowsdf, 2):
    for column_indexdf, valuedf in enumerate(rowdf, 4):
        ws.cell(row=row_indexdf, column = column_indexdf, value = valuedf)

wb.save('test.xlsm')

First time running:

   Col_1  Col_2  Col_3
0      1      6     11
1      2      7     12
2      3      8     13

Second time running:

   Col_1  Col_2  Col_3  Unnamed: 3
0      1    NaN    NaN          22
1      2    NaN    NaN          24
2      3    NaN    NaN          26

Why are the second two columns now blank? I did not write over them...I do not think.

1 Answers1

0

Neither openpyxl nor pandas evaluate excel formulas. Opening a file in pandas is equivalent to opening the file with openpyxl in data_only=True mode.

So here, when you save the file that you've written, a bunch of NaNs appear because excel never had the chance to evaluate the changed sheet.

Interesting enough, if you open the file with openpyxl and further investigate one of the NaN cells, the expected formula is still there!

BeRT2me
  • 12,699
  • 2
  • 13
  • 31