0

I have a lot of reports that I want to compile into a single dataframe in python.

This code works to loop through my directory and read all of the report files where the sheet name is the same in every file... I have many sheets in each workbook but only want to find the sheet_names that contain a specific string, 'Report'.

import pandas as pd
from pathlib import Path
import os
import glob

pathsting= 'path/to/working/directory'
rootdir = Path(pathsting)
onlydirs = [f for f in os.listdir(rootdir) if os.path.isdir(os.path.join(rootdir, f))]

df0 = pd.DataFrame()
for direct in onlydirs:
    print(direct)
    dirpathstring = pathsting + '\\' + direct
    dirpath = Path(dirpathstring)
    onlyfiles = [f for f in os.listdir(dirpath) if os.path.isfile(os.path.join(dirpath, f))]
    for f in dirpath.glob("*Report.xlsm"):
        print(f.name)
        temp = pd.read_excel(f, sheet_name='Report')
        df0 = pd.concat([df0, temp])
display(df0)

Now suppose that over time the report changes formatting and instead of sheet_name='Report' it becomes sheet_name='XYZ Report'. I have many reports and the name changes a few times. I do not want to hard code all possible report names in multiple different loops.

I was able to use glob to read all files that end in 'Report.xlsm', but is there a similar method to read sheet_names that contain the text 'Report' instead of the exact string?

MDR
  • 2,610
  • 1
  • 8
  • 18

2 Answers2

0

You need to write a function that reads the sheet name to see if it contains the word "Report". This may help you to obtain the sheet name:

How to obtain sheet names from XLS files without loading the whole file?

Brndn
  • 676
  • 1
  • 7
  • 21
0

Try:

import pandas as pd
import glob
import re

path = r'./files' # use your path
all_files = glob.glob(path + "/*.xlsm")

# case insensitive pattern for file names like blahReportblah or fooreportingss etc.  Modify as required if necessary.
pattern = r'(?i)(.*report.*)'

# create empty list to hold dataframes from sheets found
dfs = []

# for each file in the path above ending .xlsm
for file in all_files:
    #if the file name has the word 'report' or even 'rEpOrTs' in it
    if re.search(pattern, file):
        #open the file
        ex_file = pd.ExcelFile(file)
        #then for each sheet in that file
        for sheet in ex_file.sheet_names:
            #check if the sheet has 'RePORting' etc. in it
            if re.search(pattern, sheet):
                #if so create a dataframe (maybe parse_dates isn't required).  Tweak as required
                df = ex_file.parse(sheet, parse_dates=True)
                #add this new (temp during the looping) frame to the end of the list
                dfs.append(df)
            else:
                #if sheet doesn't have the word 'report' move on, nothing to see here
                continue
    else:
        #if file doesn't have the word 'report' move on, nothing to see here
        continue

#handle a list that is empty
if len(dfs) == 0:
    print('No file or sheets found.')
    #create a dummy frame
    df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
    df = dfs[0]
#or concatenate more than one frame together
else:
    df = pd.concat(dfs, ignore_index=True)
    df = df.reset_index(drop=True)

#check what you've got
print(df.head())
MDR
  • 2,610
  • 1
  • 8
  • 18