1

Let's say we have many excel files with the multiple sheets as the following file data1.xlsx:

Sheet 1: 2021_q1_bj

   a  b   c  d
0  1  2  23  2
1  2  3  45  5

Sheet 2: 2021_q2_bj

   a  b   c  d
0  1  2  23  6
1  2  3  45  7

Sheet 3: 2019_q1_sh

   a  b   c
0  1  2  23
1  2  3  45

Sheet 4: 2019_q2_sh

   a  b   c
0  1  2  23
1  2  3  40

I need to obtain sheet name for each sheet, then split them by _, store the first part as year, the second part as quarter, and the last part as city.

Finaly I will save them back to excel file with multiple sheets.

ie., for the first sheet:

   a  b   c  d  year quarter city
0  1  2  23  2  2021      q1   bj
1  2  3  45  5  2021      q1   bj
2  1  2  23  6  2021      q1   bj
3  2  3  45  7  2021      q1   bj

How could I achive this in Python? Thanks.

To loop all the excel files:

base_dir = './'
file_list = os.listdir(base_dir)

for file in file_list:
    if '.xlsx' in file:
        file_path = os.path.join(file_path, )
        dfs = pd.read_excel()
ah bon
  • 9,293
  • 12
  • 65
  • 148
  • 1
    with `sheet_name=None`, `pandas.read_excel` reads all sheets from a file into a dict of dataframes, where the sheetnames are the dict keys. Given that, it should be easy enough to loop through all the files and do this in a loop. – Trenton McKinney Aug 17 '21 at 00:46

1 Answers1

1

You can use use f = pd.ExcelFile('data1.xlsx') to read the excel file in as an object, then loop through the list of sheet names by iterating through f.sheet_names, splitting each sheet name such as the "2019_q1_sh" string into the appropriate year, quarter, city and setting these as values of new columns in the DataFrame you are reading in from each sheet.

Then create a dictionary with sheet names as keys, and the corresponding modified DataFrame as the values. You can create a custom save_xls function that takes in such a dictionary and saves it, as described in this helpful answer.

Update: since you want to loop through all excel files in your current directory, you can use the glob library to get all of the files with extension .xlsx and loop through each of these files, read them in, and save a new file with the string new_ in front of the file name

import pandas as pd
from pandas import ExcelWriter
import glob

"""
Save a dictionary of dataframes to an excel file, with each dataframe as a separate page

Reference: https://stackoverflow.com/questions/14225676/save-list-of-dataframes-to-multisheet-excel-spreadsheet
"""
def save_xls(dict_df, path):
    writer = ExcelWriter(path)
    for key in dict_df:
        dict_df[key].to_excel(writer, key)
    writer.save()

## loop through all excel files
for filename in glob.glob("*.xlsx"):
    f = pd.ExcelFile(filename)
    dict_dfs = {}
    for sheet_name in f.sheet_names:
        df_new = f.parse(sheet_name = sheet_name)

        ## get the year and quarter from the sheet name 
        year, quarter, city = sheet_name.split("_")
        df_new["year"] = year
        df_new["quarter"] = quarter
        df_new["city"] = city

        ## populate dictionary 
        dict_dfs[sheet_name] = df_new 

    save_xls(dict_df = dict_dfs, path = "new_" + filename)
Derek O
  • 16,770
  • 4
  • 24
  • 43
  • Thanks, if I would like to loop all the excel file and apply method above as you posted to each one? – ah bon Aug 17 '21 at 03:15
  • 1
    Yes you can do that - create an outer loop before you define the pd.ExcelFile object `f` and end the loop after you run `save_xls`. you'll also want to make sure that you don't hardcode the filenames like I did – Derek O Aug 17 '21 at 14:16
  • 1
    @ahbon sure you can use a package like `glob` to read in all files with extension `.xlsx` – Derek O Aug 17 '21 at 14:56
  • 1
    Many thanks for your kind help, I'll test with my real data, have a good day. – ah bon Aug 17 '21 at 15:11
  • No problem! Glad my answer was helpful – Derek O Aug 17 '21 at 16:36