0

I am trying to extract the data from xlsx file from the multiple folders and also to get the folder name for each row to identify where the is extracted from. I am able to extract the data from all the folders however i am unable to get the folder names to the dataframe. Please help.

Folder structure -

Month-Year - 2020-02 Day folder - 2020-02-01

Under the day folder consist of xlsx files.

paths = []
arr = []

for root, dirs, files in os.walk(Full_Path):
    for file in files:
        if file.endswith(".xlsx"):
             #print(os.path.join(root, file))
             ab = os.path.join(root, file)
             print(ab)
             arr.append(paths)   
             paths.append(ab)


for lm in arr:
    print(lm)   


all_data = pd.DataFrame()
for f in paths:
    df = pd.read_excel(f, sheet_name='In Queue', usecols = fields)
    df['Date'] = lm 
    all_data = all_data.append(df,ignore_index=True)

I have also tried different ways but not getting the output.

Natesh
  • 3
  • 2
  • One of the stone tablet commandments of pandas: [Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Feb 12 '20 at 15:14

1 Answers1

1

Consider building a list of data frames to be concatenated once outside the loop. Also, use assign to generate the Date column during the loop:

df_list = []

for root, dirs, files in os.walk(Full_Path):
    for file in files:
        if file.endswith((".xls", ".xlsx", ".xlsb", ".xlsm", ".odf")):
             xl_file = os.path.join(root, file)
             df = (pd.read_excel(xl_file, sheet_name='In Queue', usecols = fields)
                     .assign(Date = xl_file))

             df_list.append(df)


final_df = pd.concat(df_list, ignore_index=True)
Parfait
  • 104,375
  • 17
  • 94
  • 125