11

Hello I would like to concatenate three excels files xlsx using python.

I have tried using openpyxl, but I don't know which function could help me to append three worksheet into one.

Do you have any ideas how to do that ?

Thanks a lot

Auré Vat
  • 139
  • 1
  • 1
  • 3
  • this typically does not work ... in my experience you must read in all 3 work xls files . then manually merge them (somehow) , then write out to a new xls file ... – Joran Beasley Apr 03 '13 at 16:54

6 Answers6

26

Here's a pandas-based approach. (It's using openpyxl behind the scenes.)

import pandas as pd

# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]

# read them in
excels = [pd.ExcelFile(name) for name in excel_names]

# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]

# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]

# concatenate them..
combined = pd.concat(frames)

# write it out
combined.to_excel("c.xlsx", header=False, index=False)
DSM
  • 342,061
  • 65
  • 592
  • 494
  • 1
    thanks DSM, it looks much simplier, however panda needs numpy which can't run on Windows 64bits machine. Do you think I could try to do the same thing with openpyxl – Auré Vat Apr 03 '13 at 19:41
  • 2
    @AuréVat: Numpy needs 32-bit *Python*, not 32-bit Windows. Lots of people run 32-bit Python on 64-bit Windows. You can have multiple Python environments on one machine. – John Y Apr 03 '13 at 23:00
  • @AuréVat: Also, there is at least one good unofficial 64-bit numpy build: http://stackoverflow.com/questions/11200137/installing-numpy-on-64bit-windows-7-with-python-2-7-3 – John Y Apr 03 '13 at 23:01
  • This code will append the first file twice - once with header, and next without header. The line `frames[1:]= [df[1:] for df in frames]` needs to be changed to `frames_new=[df[1:] for df in frames]` and then concatenate this new frame by changing `combined = pd.concat(frames)` to `combined = pd.concat(frames_new)` – BajajG Jan 25 '16 at 07:07
  • @Neha: ah, or just use `frames[1:]` in the listcomp, which is what I'd intended. :-/ – DSM Jan 25 '16 at 13:53
  • @DSM: This works for small number of files but for a large number of files (number of files= 88 and total number of rows after concatenation < 2^20), it shows an exception in workbook destructor: `Exception Exception: Exception('Exception caught in workbook destructor. Explici t close() may be required for workbook.',) in > ignored` Any thoughts on that? – BajajG Jan 30 '16 at 07:09
  • 1
    Just a note, i tried this with about 60 files, it hung for a long time and then finally produced a broken result. i tried reducing it to only 2 files and then got a missing module warning. after installing openpyxl, and running it again with the 60 files, it was super fast. not sure why that happened. but interesting gotcha. – deweydb Mar 22 '16 at 00:32
  • When I use the above code the data from the second workbook is pasted on the first empty row so it ends up below and to the right of the data from the first workbook. I want to have it directly to the right of the data from the first workbook. How could I alter the code to do this? – DavidK11 Jul 24 '17 at 12:11
  • @DSM I know its old post , but need a help , how to tweak `frames[1:] = [df[1:] for df in frames[1:]]` so that for the second excel header is not removed. I want only header removed for the first excel.. – RonyA Aug 28 '18 at 19:58
  • My computer gets so slower when I try to concatenate 107 big excel files. To deal with this, I avoided the declaration of lists, and also I saved the concatenated DataFram as csv. This last modification is due to the final DF contains more than 3,000,000 lines. To avod declaring lists use: `combined = pd.concat((df[1:]for df in (x.parse(x.sheet_names[0], header=None,index_col=None)for x in (pd.ExcelFile(name)for name in tqdm(excel_names)))))`. To save the DF as tsv use `combined.to_csv(f"{out_dir}concatenado.csv", header=False, index=False)` – Antonio Ramírez Feb 15 '21 at 20:46
  • doesn't work normal for files with some not simple formatting – Demetry Pascal Mar 25 '22 at 10:29
9

I'd use xlrd and xlwt. Assuming you literally just need to append these files (rather than doing any real work on them), I'd do something like: Open up a file to write to with xlwt, and then for each of your other three files, loop over the data and add each row to the output file. To get you started:

import xlwt
import xlrd

wkbk = xlwt.Workbook()
outsheet = wkbk.add_sheet('Sheet1')

xlsfiles = [r'C:\foo.xlsx', r'C:\bar.xlsx', r'C:\baz.xlsx']

outrow_idx = 0
for f in xlsfiles:
    # This is all untested; essentially just pseudocode for concept!
    insheet = xlrd.open_workbook(f).sheets()[0]
    for row_idx in xrange(insheet.nrows):
        for col_idx in xrange(insheet.ncols):
            outsheet.write(outrow_idx, col_idx, 
                           insheet.cell_value(row_idx, col_idx))
        outrow_idx += 1
wkbk.save(r'C:\combined.xls')

If your files all have a header line, you probably don't want to repeat that, so you could modify the code above to look more like this:

firstfile = True # Is this the first sheet?
for f in xlsfiles:
    insheet = xlrd.open_workbook(f).sheets()[0]
    for row_idx in xrange(0 if firstfile else 1, insheet.nrows):
        pass # processing; etc
    firstfile = False # We're done with the first sheet.
Henry Keiter
  • 16,863
  • 7
  • 51
  • 80
  • 1
    I didn't think `xlrd` could handle `.xlsx` files, but `openpyxl` could. Am I mistaken? – DSM Apr 03 '13 at 17:09
  • 1
    @DSM It can handle `.xlsx`, at least I've never had trouble with it. – Henry Keiter Apr 03 '13 at 17:14
  • 1
    Well, I'll be -- looks like I'm behind the times! I only had 0.7.1 installed, and it was giving `XLRDError`s, but 0.9.0 worked just fine on them. Learn something new every day! – DSM Apr 03 '13 at 17:20
  • @AuréVat See my edit, but remember that the code I gave is only a starting point. – Henry Keiter Apr 03 '13 at 17:48
  • Thanks I will play with it, and get back if I have more questions thanks a lot – Auré Vat Apr 03 '13 at 17:57
  • After playing with it I added "wkbk.save('testall.xlsx')" but this file once created had issue when I opened it with Excel. "Error because the file format or extension is not valid. – Auré Vat Apr 03 '13 at 18:16
  • @AuréVat That sounds like a different problem that calls for different debugging. If you need help with it, check the [documentation](https://secure.simplistix.co.uk/svn/xlwt/trunk/xlwt/doc/xlwt.html), use some `print` statements, or open a different question here on SO, but the comments section is not the place for me to try to debug your code. A brief glance at the doc seems to suggest that `xlwt` can't write to `.xlsx` yet, so save it as `.xls` instead. – Henry Keiter Apr 03 '13 at 18:23
  • @HenryKeiter will that issue be with XLWT being incompatible with xlsx : -> http://stackoverflow.com/questions/15738904/does-xlwt-support-xlsx-format – Auré Vat Apr 03 '13 at 18:23
  • @AuréVat See my previous comment. – Henry Keiter Apr 03 '13 at 18:45
  • If you are going to use `xlrd` anyway, then your best bet for writing .xlsx files is `XlsxWriter`, which is faster, more full-featured, and in my opinion easier to use than `openpyxl`. (The advantage of `openpyxl` is that it is a single package that does both reading and writing.) – John Y Apr 03 '13 at 23:04
6

When I combine excel files (mydata1.xlsx, mydata2.xlsx, mydata3.xlsx) for data analysis, here is what I do:

import pandas as pd
import numpy as np
import glob

all_data = pd.DataFrame()
for f in glob.glob('myfolder/mydata*.xlsx'):
   df = pd.read_excel(f)
   all_data = all_data.append(df, ignore_index=True)

Then, when I want to save it as one file:

writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
3

Solution with openpyxl only (without a bunch of other dependencies).

This script should take care of merging together an arbitrary number of xlsx documents, whether they have one or multiple sheets. It will preserve the formatting.

There's a function to copy sheets in openpyxl, but it is only from/to the same file. There's also a function insert_rows somewhere, but by itself it won't insert any rows. So I'm afraid we are left to deal (tediously) with one cell at a time.

As much as I dislike using for loops and would rather use something compact and elegant like list comprehension, I don't see how to do that here as this is a side-effect show.

Credit to this answer on copying between workbooks.

#!/usr/bin/env python3

#USAGE
#mergeXLSX.py <a bunch of .xlsx files> ... output.xlsx
#
#where output.xlsx is the unified file

#This works FROM/TO the xlsx format. Libreoffice might help to convert from xls.
#localc --headless  --convert-to xlsx somefile.xls

import sys
from copy import copy

from openpyxl import load_workbook,Workbook

def createNewWorkbook(manyWb):
    for wb in manyWb:
        for sheetName in wb.sheetnames:
            o = theOne.create_sheet(sheetName)
            safeTitle = o.title
            copySheet(wb[sheetName],theOne[safeTitle])

def copySheet(sourceSheet,newSheet):
    for row in sourceSheet.rows:
        for cell in row:
            newCell = newSheet.cell(row=cell.row, column=cell.col_idx,
                    value= cell.value)
            if cell.has_style:
                newCell.font = copy(cell.font)
                newCell.border = copy(cell.border)
                newCell.fill = copy(cell.fill)
                newCell.number_format = copy(cell.number_format)
                newCell.protection = copy(cell.protection)
                newCell.alignment = copy(cell.alignment)

filesInput = sys.argv[1:]
theOneFile = filesInput.pop(-1)
myfriends = [ load_workbook(f) for f in filesInput ]

#try this if you are bored
#myfriends = [ openpyxl.load_workbook(f) for k in range(200) for f in filesInput ]

theOne = Workbook()
del theOne['Sheet'] #We want our new book to be empty. Thanks.
createNewWorkbook(myfriends)
theOne.save(theOneFile)

Tested with openpyxl 2.5.4, python 3.4.

pbarill
  • 640
  • 8
  • 16
1

You can simply use pandas and os library to do this.

import pandas as pd
import os
#create an empty dataframe which will have all the combined data
mergedData = pd.DataFrame()
for files in os.listdir():
    #make sure you are only reading excel files
    if files.endswith('.xlsx'):
        data = pd.read_excel(files, index_col=None)
        mergedData = mergedData.append(data)
        #move the files to other folder so that it does not process multiple times
        os.rename(files, 'path to some other folder')

mergedData DF will have all the combined data which you can export in a separate excel or csv file. Same code will work with csv files as well. just replace it in the IF condition

Dhruv Kadia
  • 71
  • 1
  • 12
0

Just to add to p_barill's answer, if you have custom column widths that you need to copy, you can add the following to the bottom of copySheet:

        for col in sourceSheet.column_dimensions:
            newSheet.column_dimensions[col] = sourceSheet.column_dimensions[col]

I would just post this in a comment on his or her answer but my reputation isn't high enough.

Scott Weaver
  • 51
  • 1
  • 2