Trying to make a program that takes multiple spreadsheets from multiple workbooks and merge them with there equivalent in the other workbooks

Question

import pandas as pd
 from os import listdir

 from os.path import isfile , join

folder = "c:/sheets"

excel_names = [f for f in listdir(folder) if isfile(join(folder, f))]
 print(excel_names)
 excel_files = []
for item in excel_names:
 item = folder + item excel_files.append(item)
 # read them in
excels = [pd.ExcelFile(name) for name in excel_files]
# turn them into dataframes
 frames = [x.parse(x.sheet_names[0], header=None, index_col=None) 
for x in excels]
# delete the first row for all frames except the first
#  i.e. remove the header row -- assumes it's the first

 frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
 combined = pd.concat(frames) combined.to_excel(folder+"/combined.xlsx", header=False, index=False)

This is the code I wrote so far but it seems to work for one sheet only in multiple workbook. What should I do?

Hey! Are you trying to merge multiple sheets from multiple excels into one data frame? — the_good_pony, Oct 30 '19 at 19:25
What's your intended output? a dictionary of concatenated dataframes, split by worksheet? — Umar.H, Oct 30 '19 at 19:25
@Datanovice yes i'm trying to collect data from multiple sheets from multiple workbook and merge them into one work book (speared sheets) — Mohamed Hassan, Oct 30 '19 at 19:31
Hey @MohamedHassan are the sheets all the same shape? Like the fields and data types ? — the_good_pony, Oct 30 '19 at 19:31
@the_good_pony yes every sheet will be the same shape with his equivalent sheets in other workbooks but every sheet may have different shape in the same workbook — Mohamed Hassan, Oct 30 '19 at 19:34

score 1 · Answer 1 · answered Oct 30 '19 at 20:15

I think this might help.

I have 3 Excel files in a directory which have data which looks like this

Some of the spreadsheets have multiple sheets. All have the same shape of data, i.e. the same columns and data types.

Import the packages, i'm using glob, just a personal preference for getting file lists

import pandas as pd
import glob

Iterate through the directory grabbing each file, grabbing all the sheets in the spreadsheet and using this get a dataframe for each sheet. Each dataframe gets written to a list.

ls_df = []

for file in glob.glob('/location/to/file/*'):
    print(file)
    for sheet in pd.ExcelFile(file).sheet_names:
        ls_df.append(pd.read_excel(file,sheet))

Then use the list to concat to make a dataframe of all the spreadsheets and their sheets.

df = pd.concat(ls_df)

You end up with a dataframe which looks like this

Export to excel

df.to_excel('test_4.xlsx', index = 0 )

there is an error C:/Users/mohamed.h.mohamad/.spyder-py3/temp.py:9: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'. — Mohamed Hassan, Oct 30 '19 at 20:43
Hey @MohamedHassan sounds like your spreadsheets may not have exactly the same columns. This answer might help https://stackoverflow.com/questions/50501787/python-pandas-user-warning-sorting-because-non-concatenation-axis-is-not-aligne — the_good_pony, Oct 30 '19 at 21:01

Umar.H · Accepted Answer · 2019-10-31T10:06:14.500

0

IIUC,

This will only work if your dataframe has the same shape (column wise) for each Sheet else you'll need to modify and adapt the code to your use-case.

Step 1: list xlsx files.

from pathlib import Path
excels = [f for f in Path('c:/sheets').glob('*.xlsx')]

Step 2: Create a list of ExcelFile objects.

xls = [pd.ExcelFile(f) for f in excels]

Step 3: Create a dictionary of key,value pairs split by worksheets.

d = {}

for file in xls:
    for sheet in file.sheet_names:
        if sheet not in d:
            d[sheet] = [] # creates the key which is unique
        else:
            pass # ensures we don't overwrite 
        df = pd.read_excel(file,sheet_name=sheet)
        d[f'{sheet}'].append(df)

Finally, concat the sheets into a dataframe_dictionary:

df_dict = {}
for k,v in d.items():
    df_dict[f'{k}'] = pd.concat(v)

print(df_dict['Sheet1'])
      DatA   Vals
0     A  16577
1     B  37287
2     C  32761
0     A  38707
1     B  17301
..  ...    ...
1     B   6292
2     C  49310
0     A   4401
1     B  12170
2     C   8978

[150 rows x 2 columns]

Tested on 50 xlsx files with two sheets both of which had the same number of columns rows but differing data.

Can't see it not working on your sample case.

edited Oct 31 '19 at 10:06

answered Oct 30 '19 at 21:32

Umar.H

22,559
7
39
74

wrong syntax [f for f in Path.('c:/sheets').glob('*.xlsx')] ^ SyntaxError: invalid syntax – Mohamed Hassan Oct 31 '19 at 06:48
is there some thing u missed? – Mohamed Hassan Oct 31 '19 at 06:49
@Mohamed Hasan slight typo check now :) – Umar.H Oct 31 '19 at 10:06
this read them but doesn't collect them in one workbook is there more lines need to be done or something? – Mohamed Hassan Nov 02 '19 at 08:15
you request clearly said you wanted a collection of dataframes per sheet of your excel files, which is what the above gives you? what output are you getting because it works for me. – Umar.H Nov 02 '19 at 15:26
yeah I want to collect them and put them in another workbook as individual sheets or why I would bother to collect them in the first place – Mohamed Hassan Nov 02 '19 at 17:38

Trying to make a program that takes multiple spreadsheets from multiple workbooks and merge them with there equivalent in the other workbooks

2 Answers2

Step 1: list xlsx files.

Step 2: Create a list of ExcelFile objects.

Step 3: Create a dictionary of key,value pairs split by worksheets.

Finally, concat the sheets into a dataframe_dictionary: