1

I've been wrestling with this for a number of days and researching without success. I am trying to write a function that

  1. loops through a directory
  2. opens a excel file that matches a string pattern
  3. opens the file and searches for a specific worksheet ('importer')
  4. copies the data into csv and continues to append to the csv until all files complete.
  5. I'd like the function to ignore files that do not include the 'importer' tab or simply access the next file in the FOR loop without doing the remainder ('CSV FILE CREATION').
  6. File creation should only happen where both the file name matches the pattern and the 'importer' worksheet exist. I feel like I am close but need just a little direction.
def append_all(input_directory):
    for file in os.listdir(input_directory):
        # Qualify if file exist
        if bool(re.search(pattern, file)) == True:
            # Join directory path name to file name
            in_fpath = os.path.join(input_directory, file)
            out_fpath = os.path.join(input_directory, 'history.csv')
            wrkbk = xlrd.open_workbook(in_fpath)
            if wrkbk.sheet_names() == 'importer':
                wrksht = wrkbk.sheet_by_name('importer')
                # Handling excel refresh date value to be used to populate csv file
                refresh_date_float = wrksht.cell_value(1, 4)
                refresh_date_value = xlrd.xldate_as_datetime(refresh_date_float, wrkbk.datemode).strftime(
                    '%Y/%m/%d %H:%M')
                # else:
                # continue

                # CSV FILE CREATION
                # Qualify if file exist. Default returns TRUE
                if os.path.isfile(out_fpath) == False:
                    # os.mkdir(output_directory)
                    # file will be created if it does not exist
                    with open(out_fpath, 'w', newline='') as csvfile:
                        wr = csv.writer(csvfile)
                        # start row index 3 to skip unecessary data
                        for rownum in range(3, wrksht.nrows):
                            # wr.writerow(wrksht.row_values(rownum) + list(refresh_date_value))
                            wr.writerow(list(wrksht.row_values(rownum)) + [refresh_date_value])
                            # Start append data
                else:
                    with open(out_fpath, 'a', newline='') as csvfile:
                        wr = csv.writer(csvfile)
                        # start row index 4 to skip header row
                        for rownum in range(4, wrksht.nrows):
                            # wr.writerow(wrksht.row_values(rownum)  + list(refresh_date_value))
                            wr.writerow(list(wrksht.row_values(rownum)) + [refresh_date_value])


csvfile.close()
print('process complete')
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
bbal20
  • 113
  • 4
  • 11
  • One conventional way of returning to the top of a `for` block is `continue`. I see you have a `continue` commented out in your code. Did it not do what you wanted it to do? If not, what did it do, and what did you want it to do? – Kevin Jul 13 '20 at 18:46
  • [`.sheet_names()`](https://xlrd.readthedocs.io/en/latest/api.html#xlrd.book.Book.sheet_names) returns a list - `wrkbk.sheet_names() == 'importer'` will always *be* False. You should use [in or not in](https://docs.python.org/3/reference/expressions.html#membership-test-operations) – wwii Jul 13 '20 at 18:51
  • @Kevin When I added the 'else: continue' it just runs without error but no file is created in the directory. I think either I am not handling the false scenarios properly or my indentation is wrong? – bbal20 Jul 13 '20 at 20:23
  • @wwii Maybe this is why I am not seeing a file written to the directory. I'm still a little unclear how I use 'in' or 'not in' in a list object. Would it be written like '''if 'importer' in wrkbk.sheet_names():'''? – bbal20 Jul 13 '20 at 20:30
  • `if 'importer' in wrkbk.sheet_names():` - Yes. ... [Check if something is (not) in a list in Python](https://stackoverflow.com/questions/10406130/check-if-something-is-not-in-a-list-in-python) – wwii Jul 13 '20 at 21:52

1 Answers1

0
  • Use .rglob from the pathlib module to find the files with the specified pattern.
    • This is like calling Path.glob() with '**/' added in front of the given relative pattern.
    • The pathlib module offers classes representing filesystem paths with semantics appropriate for different operating systems.
  • It will be far easier to use pandas.read_excel, using the sheet_name parameter, inside a try-except block.
    • The try-except block will try to load the file with the worksheet name. If the worksheet isn't present, an exception will occur. In this case, if there's an exception the script will go to the next file.
  • Combine all the files into a single dataframe using pandas.concat, and then save it to a csv with .to_csv.
from pathlib import Path
import pandas as pd

p = Path('c:/.../path_to_files')  # path to files

files = list(p.rglob('*.xlsx'))  # get all xlsx files that match the pattern

list_of_dataframes = list()  # list to add dataframe to
for file in files:
    try:
        list_of_dataframes.append(pd.read_excel(file, sheet_name='importer'))  # add dataframe from Excel file to list
    except XLRDError:  # exception because there's not importer worksheet
        print(f'{file} did have the "importer" worksheet')
        
df = pd.concat(list_of_dataframes)  # combine the dataframes from all the files

df.to_csv('my_combined_files.csv', index=False)  # save to a csv

As a function

def create_csv_from_multiple_xlsx_files(path_to_files: str, filename_pattern: str, save_name: str):
    
    p = Path(path_to_files)  # convert to pathlib object

    files = list(p.rglob(filename_pattern))  # get all xlsx files that match the pattern

    list_of_dataframes = list()  # list to add dataframe to
    for file in files:
        try:
            list_of_dataframes.append(pd.read_excel(file, sheet_name='importer'))  # add dataframe from Excel file to list
        except XLRDError:  # exception because there's not importer worksheet
            print(f'{file} did have the "importer" worksheet')

    df = pd.concat(list_of_dataframes)  # combine the dataframes from all the files

    df.to_csv(f'{save_name}.csv', index=False)  # save to a csv
    
    
top_level_file_dir = 'c:/.../path_to_files'  # path to files
pattern = '*.xlsx'  # filename pattern
csv_file_name = 'my_combined_files'
create_csv_from_multiple_xlsx_files(top_level_file_dir, pattern, csv_file_name)  # call function
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158