1

I need to extract an Excel worksheet from multiple workbooks and saving it to a dataframe and in turn saving that dataframe.

I have a spreadsheet that is generated at the end of each month (e.g.
June 2019.xlsx, May 2019.xlsx, April 2019.xlsx).
I need to grab a worksheet 'Sheet1'from each of these workbooks and convert these to a dataframe (df1).

I would like to have this dataframe saved.

As a nice to have, I would also like some way just to append the next month's data after the initial 'data grab'.

I'm relatively new to this, so I haven't made much progress.

import os
import glob
import pandas as pd 
import xlrd
import json
import io
import flatten_json

files = glob.glob('/Users/ngove/Documents/Python Scripts/2019/*.xlsx')
dfs={}
 for f in files: 
  dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)
MBA Coder
  • 374
  • 1
  • 3
  • 13
NimG
  • 29
  • 4

2 Answers2

0

I interpreted your statement that you want to save the dataframe as that you want to save it as a combined Excel file. This will combine all files in the folder specified that end in xlsx.

import os
import pandas as pd
from pandas import ExcelWriter

os.chdir("H:/Python/Reports/") #edit this to be your path
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']

df = pd.DataFrame()
for f in files_xlsx:
    data = pd.read_excel(f, 'Sheet1')
    df = df.append(data)

writer=ExcelWriter('Combined_Data.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()

You could update the code to grab all 2019 files by changing the one line to this:

files_xlsx = [f for f in files if f[-9:] == '2019.xlsx']

I referenced this question for most of the code and updated for xlsx and added the file save portion of the code

MBA Coder
  • 374
  • 1
  • 3
  • 13
0

You can drop all of your files in a directory (e.g. current directory). Then append all of your excel files in a list (e.g. files_xls). Iterate over all your files and use pandas.read_excel to get the respective dataframes (e.g. list_frames).

Below, you can find an example:

import os
import pandas as pd

path = os.getcwd() # get cur dir
files = os.listdir(path) # get all the files in your cur dir 
# get only the xls or xlsm (this depends on you)
files_xls = [f for f in files if (f[-3:] == 'xls' or f[-4:] == 'xlsm')]

df = pd.DataFrame()
list_frames = []

for f in files_xls:
    print("Processing file: %s" %f)
    try:
        # the following will give you the dataframe
        # the fun params depends on your data format
        data = pd.read_excel(f, 'Sheet1', header=0, index_col=None,
                             sep='delimiter', error_bad_lines=False,
                             skip_blank_lines=True, comment=',,')

    except:
        pass
    list_frames.append(data)

# at the end you can concat your data if you want and remove any dublicate
df = pd.concat(list_frames, sort=False).fillna(0)
df = df.drop_duplicates()

# at the end you can save it
writer = pd.ExcelWriter("your_title" + ".xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name="Sheets1", index=False)   
writer.save()

I hope this helps.

GioR
  • 596
  • 2
  • 8
  • 19