0

I want to loop through many Excel files in a single folder and pull information contained in certain tabs only if those tabs contain a certain string value. So for example, one spreadsheet may have 20 tabs, but I only want the tab (and it's information) that contains the string "Apples" somewhere in that tab. (It looks like it is always located in the first row). I then want to aggregate all of these tabs into one spreadsheet. This problem is unique from previous SO questions because my tabs are not uniformly named. Sometimes, the tab I want is called "Apple Sauce" and other times it's called "Apple Jacks". This is why I need to look in the tab itself for my string and I can't rely on just specifying the sheet name.

I have written the following code so far:

    import pandas as pd
    import os

    root = r"my_dir"

    agg_df = pd.DataFrame()

    for directory, subdirectory, files in os.walk(root):
        for file in files:
            if file.endswith('.xlsm'):
                filepath = os.path.join(directory, file)
# I want to do some kind of if statement here maybe to say if sheet_name.contains("Apples")
                df_temp = pd.read_excel(filepath)
                df_temp['Filepath'] = filepath
                agg_df = agg_df.append(df_temp)
thedatasleuth
  • 539
  • 4
  • 22
  • Possible duplicate of [Reading/parsing Excel (xls) files with Python](https://stackoverflow.com/questions/2942889/reading-parsing-excel-xls-files-with-python) – VanBantam Apr 19 '19 at 17:51

1 Answers1

1

You can use re module along with ExcelFile function to read the sheet_names and iterate through those to read every single sheet that meets certain regex criteria.

import pandas as pd
import os
import re

root = r"Directory_path"

agg_df = pd.DataFrame()

for directory, subdirectory, files in os.walk(root):
    for file in files:
        if file.endswith('.xlsm'):
            filepath = os.path.join(directory, file)
            xls = pd.ExcelFile(filepath)
            for i in xls.sheet_names:
                if re.search(r'Apple', i):
                    df_temp = pd.read_excel(filepath, sheet_names=i)
                    #Continue the data processing that you want to carry on here and append the result to a dataframe
amrtw09
  • 183
  • 1
  • 8
  • But won't this just iterate through the actual sheet names? I unfortunately need to look within the actual sheets since my sheet names are not uniformly named – thedatasleuth Apr 19 '19 at 18:36