Reading/parsing Excel (xls) files with Python

Question

What is the best way to read Excel (XLS) files with Python (not CSV files).

Is there a built-in package which is supported by default in Python to do this task?

I think we have a built-in-package in python `import openpyxl` I don't know in which version they have added — Azhar Uddin Sheikh, May 06 '22 at 09:47

score 110 · Answer 1 · edited Oct 31 '22 at 12:46

I highly recommend xlrd for reading .xls files. But there are some limitations(refer to xlrd github page):

Warning

This library will no longer read anything other than .xls files. For alternatives that read newer file formats, please see http://www.python-excel.org/.

The following are also not supported but will safely and reliably be ignored:
- Charts, Macros, Pictures, any other embedded object, including embedded worksheets.
- VBA modules
- Formulas, but results of formula calculations are extracted.
- Comments
- Hyperlinks
- Autofilters, advanced filters, pivot tables, conditional formatting, data validation
Password-protected files are not supported and cannot be read by this library.

voyager mentioned the use of COM automation. Having done this myself a few years ago, be warned that doing this is a real PITA. The number of caveats is huge and the documentation is lacking and annoying. I ran into many weird bugs and gotchas, some of which took many hours to figure out.

UPDATE:

For newer .xlsx files, the recommended library for reading and writing appears to be openpyxl (thanks, Ikar Pohorský).

For Excel 2007+ files (`.xlsx`) you would probably use [OpenPyXL](http://openpyxl.readthedocs.org/). — Ikar Pohorský, Aug 06 '15 at 12:01
A little late to the party, but do you have any suggestions for libraries to overwrite an .xls file and preserve macros/pictures? I created a solution using xlrd/xlwt/xlutils and didn't realize until the end that the macros/pictures were getting removed. I've used Openpyxl/XlsxWriter (for xlsx) in the past, but obviously none of these libraries are fitting the use case that I have. Does Pandas also do this since it uses the xlrd engine? — Michael Donahue, Feb 08 '22 at 17:15

score 64 · Answer 2 · edited Jul 19 '22 at 19:19

64

You can use pandas to do this, first install the required libraries:

$ pip install pandas openpyxl

See code below:

import pandas as pd

xls = pd.ExcelFile(r"yourfilename.xls") # use r before absolute file path 

sheetX = xls.parse(2) #2 is the sheet number+1 thus if the file has only 1 sheet write 0 in paranthesis

var1 = sheetX['ColumnName']

print(var1[1]) #1 is the row number...

edited Jul 19 '22 at 19:19

Skully

2,882
3
20
31

answered May 23 '17 at 04:04

borgomeister

996
6
13

4

pandas is using xlrd to do the reading; you will need to also install xlrd as a dependency – congusbongus Jul 16 '19 at 01:24
If xlrd is required as a dependency why not to use it directly? – Raz Apr 29 '21 at 12:48
1

As of 2022 it appears Pandas uses openpyxl, you'll need to pip install it: `ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.` – jlhasson Apr 22 '22 at 17:54

Somil · Answer 3 · 2019-08-08T13:31:29.993

35

You can choose any one of them http://www.python-excel.org/
I would recommended python xlrd library.

install it using

pip install xlrd

import using

import xlrd

to open a workbook

workbook = xlrd.open_workbook('your_file_name.xlsx')

open sheet by name

worksheet = workbook.sheet_by_name('Name of the Sheet')

open sheet by index

worksheet = workbook.sheet_by_index(0)

read cell value

worksheet.cell(0, 0).value

edited Aug 08 '19 at 13:31

answered Apr 06 '17 at 14:15

Somil

1,921
1
21
35

The "read cell value" does not work... it raises a TypeError: 'Sheet' object is not callable. All of the rest worked great. – Newbielp Jan 14 '20 at 09:01

Foad S. Farimani · Answer 4 · 2023-03-03T10:35:58.963

29

I think Pandas is the best way to go. There is already one answer here with Pandas using ExcelFile function, but it did not work properly for me. From here I found the read_excel function which works just fine:

import pandas as pd
dfs = pd.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name")
print(dfs.head(10))

P.S. You need to have the xlrd installed for read_excel function to work

Update 21-03-2020: As you may see here, there are issues with the xlrd engine and it is going to be deprecated. The openpyxl is the best replacement. So as described here, the canonical syntax should be:

dfs = pd.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name", engine="openpyxl")

Update 03-03-2023: There are now several other options available. For example the Polars library that is written in Rust:

import polars as pl
dfs = pl.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name")

Feel free to also check the PyArrow and pyodbc libraries.

edited Mar 03 '23 at 10:35

answered Jun 12 '18 at 10:35

Foad S. Farimani

12,396
15
78
193

AttributeError: 'dict' object has no attribute 'head' – lopezdp Mar 21 '20 at 16:35
Question is clearly about xls files, so many upvotes on this answer make no sense to me rn. – Zircoz Sep 13 '22 at 08:51
@Zircoz how that should make a difference here? are any of my provided solutions not applicable to the `.xls` file format? – Foad S. Farimani Sep 14 '22 at 06:01

score 7 · Answer 5 · edited Jan 29 '20 at 18:52

For xlsx I like the solution posted earlier as https://web.archive.org/web/20180216070531/https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python. I uses modules from the standard library only.

def xlsx(fname):
    import zipfile
    from xml.etree.ElementTree import iterparse
    z = zipfile.ZipFile(fname)
    strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
    rows = []
    row = {}
    value = ''
    for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
        if el.tag.endswith('}v'):  # Example: <v>84</v>                            
            value = el.text
        if el.tag.endswith('}c'):  # Example: <c r="A3" t="s"><v>84</v></c>                                 
            if el.attrib.get('t') == 's':
                value = strings[int(value)]
            letter = el.attrib['r']  # Example: AZ22                         
            while letter[-1].isdigit():
                letter = letter[:-1]
            row[letter] = value
            value = ''
        if el.tag.endswith('}row'):
            rows.append(row)
            row = {}
    return rows

Improvements added are fetching content by sheet name, using re to get the column and checking if sharedstrings are used.

def xlsx(fname,sheet):
    import zipfile
    from xml.etree.ElementTree import iterparse
    import re
    z = zipfile.ZipFile(fname)
    if 'xl/sharedStrings.xml' in z.namelist():
        # Get shared strings
        strings = [element.text for event, element
                   in iterparse(z.open('xl/sharedStrings.xml')) 
                   if element.tag.endswith('}t')]
    sheetdict = { element.attrib['name']:element.attrib['sheetId'] for event,element in iterparse(z.open('xl/workbook.xml'))
                                      if element.tag.endswith('}sheet') }
    rows = []
    row = {}
    value = ''

    if sheet in sheets:
    sheetfile = 'xl/worksheets/sheet'+sheets[sheet]+'.xml'
    #print(sheet,sheetfile)
    for event, element in iterparse(z.open(sheetfile)):
        # get value or index to shared strings
        if element.tag.endswith('}v') or element.tag.endswith('}t'):
            value = element.text
        # If value is a shared string, use value as an index
        if element.tag.endswith('}c'):
            if element.attrib.get('t') == 's':
                value = strings[int(value)]
            # split the row/col information so that the row leter(s) can be separate
            letter = re.sub('\d','',element.attrib['r'])
            row[letter] = value
            value = ''
        if element.tag.endswith('}row'):
            rows.append(row)
            row = {}

    return rows

the second function has a mistake: it creates the 'sheetdict' dictionary, but later tries to read a 'sheets' dictionary instead. Also, add a tab after 'if sheet in sheets:' — Dmytro Gierman, Nov 25 '21 at 10:44

score 4 · Answer 6 · answered Nov 17 '19 at 05:15

If you need old XLS format. Below code for ansii 'cp1251'.

import xlrd

file=u'C:/Landau/task/6200.xlsx'

try:
    book = xlrd.open_workbook(file,encoding_override="cp1251")  
except:
    book = xlrd.open_workbook(file)
print("The number of worksheets is {0}".format(book.nsheets))
print("Worksheet name(s): {0}".format(book.sheet_names()))
sh = book.sheet_by_index(0)
print("{0} {1} {2}".format(sh.name, sh.nrows, sh.ncols))
print("Cell D30 is {0}".format(sh.cell_value(rowx=29, colx=3)))
for rx in range(sh.nrows):
   print(sh.row(rx))

on the last line you can `output values` with this line `print(sh.row_values(rx))` — oruchkin, Oct 17 '22 at 13:09

score 2 · Answer 7 · answered Aug 12 '20 at 06:06

2

For older .xls files, you can use xlrd

either you can use xlrd directly by importing it. Like below

import xlrd
wb = xlrd.open_workbook(file_name)

Or you can also use pandas pd.read_excel() method, but do not forget to specify the engine, though the default is xlrd, it has to be specified.

pd.read_excel(file_name, engine = xlrd)

Both of them work for older .xls file formats. Infact I came across this when I used OpenPyXL, i got the below error

InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.

answered Aug 12 '20 at 06:06

Deepak Harish

123
2
7

what does `pd` mean? – oruchkin Oct 17 '22 at 13:00
@oruchkin `pd` here means importing `pandas`. so once u did `import pandas as pd`. I skipped it, as its unofficially std of importing pandas & understandable. you can use pandas library and its huge functions, here `read_excel` , is part of that library. – Deepak Harish Oct 26 '22 at 09:57

score 1 · Answer 8 · answered Nov 25 '12 at 21:43

1

You might also consider running the (non-python) program xls2csv. Feed it an xls file, and you should get back a csv.

answered Nov 25 '12 at 21:43

moi

53
1

3

But the poster says he needs to read into Python... Are you suggesting running `xls2csv`, then parsing the `csv` from Python? – hcarver Nov 25 '12 at 22:01
Python-excelerator contains an executable py_xls2csv wrapper around a python converter. – fatal_error Apr 08 '15 at 22:10

score 1 · Answer 9 · edited May 23 '17 at 12:10

1

You can use any of the libraries listed here (like Pyxlreader that is based on JExcelApi, or xlwt), plus COM automation to use Excel itself for the reading of the files, but for that you are introducing Office as a dependency of your software, which might not be always an option.

edited May 23 '17 at 12:10

Community

1
1

answered May 31 '10 at 10:46

Esteban Küber

36,388
15
79
97

7

(1) pyxlreader is the absolute pox. You must not have ever tried it. See my comments here: http://stackoverflow.com/questions/1243545/programmatically-extract-data-from-an-excel-spreadsheet (2) `xlwt` WriTes files; use `xlrd` to ReaD files. – John Machin May 31 '10 at 23:30

score 1 · Answer 10 · answered Apr 08 '15 at 22:11

1

Python Excelerator handles this task as well. http://ghantoos.org/2007/10/25/python-pyexcelerator-small-howto/

It's also available in Debian and Ubuntu:

 sudo apt-get install python-excelerator

answered Apr 08 '15 at 22:11

fatal_error

5,457
2
18
18

score 1 · Answer 11 · answered Jun 19 '20 at 12:26

    with open(csv_filename) as file:
        data = file.read()

    with open(xl_file_name, 'w') as file:
        file.write(data)

You can turn CSV to excel like above with inbuilt packages. CSV can be handled with an inbuilt package of dictreader and dictwriter which will work the same way as python dictionary works. which makes it a ton easy I am currently unaware of any inbuilt packages for excel but I had come across openpyxl. It was also pretty straight forward and simple You can see the code snippet below hope this helps

    import openpyxl
    book = openpyxl.load_workbook(filename)
    sheet = book.active 
    result =sheet['AP2']
    print(result.value)

score 0 · Answer 12 · answered Sep 18 '13 at 20:35

0

For older Excel files there is the OleFileIO_PL module that can read the OLE structured storage format used.

answered Sep 18 '13 at 20:35

Gavin Smith

3,076
1
19
25

score 0 · Answer 13 · answered Dec 14 '20 at 21:16

0

If the file is really an old .xls, this works for me on python3 just using base open() and pandas:

df = pandas.read_csv(open(f, encoding = 'UTF-8'), sep='\t')

Note that the file I'm using is tab delimited. less or a text editor should be able to read .xls so that you can sniff out the delimiter.

I did not have a lot of luck with xlrd because of – I think – UTF-8 issues.

answered Dec 14 '20 at 21:16

J. Lucas McKay

29
2

If the above works for you, you do not have an Excel file but a tab-separated text file, sometimes known as a TSV file. As such, xlrd will not open it. – Chris Withers Dec 14 '20 at 22:29

Reading/parsing Excel (xls) files with Python

13 Answers13

UPDATE:

Linked

Related