0

Context: I am attempting to automate a report that is rather complicated (not conceptually, just in the sheer volume of things to keep track of). The method I settled on after a lot of investigation was to;

  1. Create a template xlsx file which has a couple summary pages containing formulas pointing at other (raw data) sheets within the file.
  2. Pull data from SQL Server and insert into template file, overwriting the raw data sheet with relevant data.
  3. Publish report (Most likely this will just be moving xlsx file to a new directory).

Obviously, I have spent a lot of time looking at other peoples solutions to this issue (as this topic has been discussed a lot). The issue I have found however is that (at least in my search) none of the methods purposed have worked for me, my belief is that the previously correct responses are no longer relevant in current versions of pandas etc. Rather than linking to the dozens of articles attempting to answer this question, I will explain the issues I have had with various solutions.

  1. Using openpyxl instead of xlsxwriter - This resulted in "BadZipFile: File is not a zip file" response. Which as I understand pertains to the pandas version, or rather the fix (mode='a') does not work due to the pandas version (I believe anything beyond 1.2 has this issue).
  2. Helper Function This does not work however, also throws the BadZipFile error.

Below is a heavily redacted version of the code which should give all the required detail.

#Imports
import os
import pyodbc
import numpy as np
import shutil
import pandas as pd
import datetime
from datetime import date
from openpyxl import load_workbook


# Set database connection variables.
cnxn = pyodbc.connect(*Credentials*)
cursor = cnxn.cursor()

df = pd.read_sql_query(script, cnxn)
df.to_excel(writer, sheet_name = 'Some Sheet',index=False)

writer.close()

Long story short here, I am finding it very frustrating that what should be very very simple is turning into a multiple day long exercise. Please if anyone has experience with this and could offer some insight I would be very grateful.

Finally, I have to admit that I am quite new to using python, though I have not found the transition too difficult until today. Most of the issues I have been having are easily solvable (for me), with the exception of this issue. If there is something I have somehow completely missed, put me on the track and I will not be a bother.

  • You start by saying this is a complicated task, then say it is simple... – Solar Mike Jul 28 '21 at 07:15
  • In my experience, I got `BadZipFile` error when the .xlsx file was encrypted with a password. But that certainly is not the only possible reason it took me a while to figure it out as I had 100s of files that were getting processed. Check the potential reasons here https://stackoverflow.com/questions/33873423/xlsx-and-xlsm-files-return-badzipfile-file-is-not-a-zip-file – shoaib30 Jul 28 '21 at 07:24
  • Also for the code snippet - https://stackoverflow.com/help/minimal-reproducible-example – shoaib30 Jul 28 '21 at 07:26
  • @SolarMike You make a good point. I had to laugh at myself when I read that. To clarify further, I mean the report itself complicated, but the process of inserting a sheet into and excel spreadsheet without deleting existing sheets should (I would have thought) have been simple. – max_settings Jul 28 '21 at 07:42
  • @shoaib30 I do not have the documents password protected. That said, my company does mark files as confidential automatically etc, so that may actually be part of the issue. I will check that out further and report back. I will update the code snippet a bit later, after reading through the link. I am showing my amateur status. – max_settings Jul 28 '21 at 07:49
  • So why not have the template with 4 spare sheets and use them as needed? That way you don't need an add function in the code, rename possibly... – Solar Mike Jul 28 '21 at 07:51
  • @SolarMike Sorry, I am not sure I follow completely. The template I am referring to is not created by Python, rather is an existing spreadsheet. The issue I have is when I try to make Python overwrite the raw data sheets (within the template), it seems to remove everything from the file and then write the raw data sheets. Rather than leaving the existing structure and only overwriting the specified sheets. Sorry if I am missing something obvious in your response. – max_settings Jul 28 '21 at 08:01
  • The code is obviously incomplete. – Charlie Clark Jul 28 '21 at 09:28
  • @CharlieClark The issue is reproducible with what I have posted. As SolarMike said (above), I had posted way too much code to really get at the issue. What is there should allow everyone to understand a lot quicker and reproduce the issue. – max_settings Jul 28 '21 at 23:49

1 Answers1

0

Okay, so I found that I was infact incorrect (big surprise). That is, my statement that the helper function does not work. It does work, the ZipFile issue was most likely caused by some form of protection on the workbook. Funny thing is, I was able to get it working with a new workbook, but when I changed the name of the new workbook it again started throwing the ZipFile error. After awhile of creating new files and trying different things I eventually got it to work.

Two things I would note about the helper function;

  1. It is not particularly efficient. At least not in the way I have set it up. I replaced all instances of 'to_excel' with 'append_df_to_excel' from the helper function. Doing this resulted in run time going from about 1-2 minutes to well over 10. I will do some more testing and see why this might be (I will post back if I find something intersting), but just something to watch for if using larger datasets.
  2. Not an issue as such, but for me to get this to work as expected, I had to alter the function slightly. Specifically, in order to use the truncate feature in my situation, I needed to move the 'truncate' section to be above the 'firstrow' section. In my situation it made more sense to do that, rather than to specify the start row prior to truncating the sheet.

Hope this helps anyone running into the same issue.

Lesson learned, as always the information is out there, its just a matter of actually paying close attention and trying things out rather than copy paste and scratching your head when things aren't working.

  • Now I have had time to test this more thoroughly I have found that the execution time has increased by an unacceptable amount. It has gone from 1-2 minutes to close to an hour. Which is an insane increase. There must be something at play here that needs investigation. The only real changes made where using the 'helper function' listed in the question. The only thing I am aware of that the function itself has done differently is using openpyxl instead of xlsxwriter. Which by itself I would not have expected a performance hit like this. I will post later if I find anything useful out. – max_settings Jul 29 '21 at 07:26