57

There are many questions on this, but there has been no simple answer on how to read an xlsb file into pandas. Is there an easy way to do this?

Gayatri
  • 2,197
  • 4
  • 23
  • 35
  • No, I don't believe so. Look at this: https://github.com/pandas-dev/pandas/issues/8540. It's an open issue. You should look at converting it first, somehow. – cs95 Jul 10 '17 at 19:07
  • That looks like a pretty old answer there. Was wondering if anything was added into the pandas package recently – Gayatri Jul 11 '17 at 15:29
  • If the issue is still open, I'm afraid not :/ – cs95 Jul 11 '17 at 15:57
  • 1
    Yeah.The issue is still open.For now, I guess I will need to convert it manually to an xlsx file and then read. – Gayatri Jul 11 '17 at 21:07

5 Answers5

83

With the 1.0.0 release of pandas - January 29, 2020, support for binary Excel files was added.

import pandas as pd
df = pd.read_excel('path_to_file.xlsb', engine='pyxlsb')

Notes:

  • You will need to upgrade pandas - pip install pandas --upgrade
  • You will need to install pyxlsb - pip install pyxlsb
Glen Thompson
  • 9,071
  • 4
  • 54
  • 50
  • 1
    getting ValueError: Unknown engine: pyxlsb. is this engine now built into pandas or do I have to install and import pyxlsb separately? – BossRoyce Apr 16 '20 at 20:52
  • Need to install it `pip3 install pyxlsb` its not built in just supported ^^ Look at the notes in the answer. – Glen Thompson Apr 16 '20 at 21:16
  • installed and imported pyxlsb. still getting ValueError: Unknown engine: pyxlsb. is there a trick to importing it? – BossRoyce Apr 16 '20 at 21:36
  • What version of pandas do you have? `pd.show_versions()` You don't need to import it. My guess is that there is a mis match in what you installed and what you are running. e.g. did you install it in python2 and run python3 or vice versa? If you run `print(pd.show_versions())` it should tell you what you are executing version wise. – Glen Thompson Apr 16 '20 at 22:28
  • why my anaconda python 3's pandas can update to 0.25.1 only? – Erik Johnsson Jun 10 '20 at 13:10
35

Hi actually there is a way. Just use pyxlsb library.

import pandas as pd
from pyxlsb import open_workbook as open_xlsb

df = []

with open_xlsb('some.xlsb') as wb:
    with wb.get_sheet(1) as sheet:
        for row in sheet.rows():
            df.append([item.v for item in row])

df = pd.DataFrame(df[1:], columns=df[0])

UPDATE: as of pandas version 1.0 read_excel() now can read binary Excel (.xlsb) files by passing engine='pyxlsb'

Source: https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html

Finrod Felagund
  • 1,231
  • 2
  • 14
  • 18
  • I was looking for some function builtin within pandas which could do this. – Gayatri Mar 21 '18 at 16:02
  • 4
    Such does not exist for now. – Finrod Felagund Mar 22 '18 at 08:51
  • I tried this but instead of dates that look like a normal dates in excel file ("Feb-20"), I am getting some float numbers in Python like 32874.0. Any ideas on how to fix this? – user8436761 Feb 11 '19 at 14:04
  • Yes, Excel remembers dates as floats. Use pandas build in method ".to_datetime()". – Finrod Felagund Feb 13 '19 at 07:50
  • 1
    thank you ! About dates conversion - it seems Excel numerates dates by integers from 1900-01-01 minus 2 days. So standard to_datetime seems not work. – Alexander Chervov Mar 06 '19 at 09:17
  • Tried the above sample and got a warning about resource locks. `sys:1: ResourceWarning: unclosed file <_io.BufferedRandom name=4>`. openpyxl has a few similar issues posted about failing to properly release files on `__exit__`. Is this a known issue for pyxlsb? – rgk Mar 06 '19 at 14:55
  • As the error is due to "unclosed file", I would assume you'll need to kill excel from your task manager. Not the best practice, but I do not have a better idea. – Finrod Felagund Mar 06 '19 at 18:18
  • Thanks ! the benefit I am getting with your solution is that - I don't need to upgrade pandas in my legacy python system and I can still read the xlsb file. – abhijat_saxena Apr 08 '21 at 14:01
7

Pyxlsb indeed is an option to read xlsb file, however, is rather limited.

I suggest using the xlwings package which makes it possible to read and write xlsb files without losing sheet formating, formulas, etc. in the xlsb file. There is extensive documentation available.

import pandas as pd
import xlwings as xw

app = xw.App()
book = xw.Book('file.xlsb')
sheet = book.sheets('sheet_name')
df = sheet.range('A1').options(pd.DataFrame, expand='table').value
book.close()
app.kill()

'A1' in this case is the starting position of the excel table. To write to xlsb file, simply write:

sheet.range('A1').value = df
gmar
  • 71
  • 1
  • 3
  • 8
    This adds a major requirement: you have to have a running instance of Excel. This won't work on Linux machines. – Zev Jun 04 '19 at 20:31
0

If you want to read a big binary file or any excel file with some ranges you can directly put at this code

range = (your_index_number)
first_dataframe = []
second_dataframe = []
with open_xlsb('Test.xlsb') as wb:
    with wb.get_sheet('Sheet1') as sheet:
        i=0
        for row in sheet.rows():
            if(i!=range):
                first_dataframe.append([item.v for item in row])
                i=i+1
            else:
                second_dataframe.append([item.v for item in row])


first_dataframe = pd.DataFrame(first_dataframe[1:], columns=first[0])
second_dataframe = pd.DataFrame(second_dataframe[:], columns=first.columns)
0

To be able to read xlsb files, it is necessary to have openpyxl installed.

As per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel

engine: str, default None

If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :

“xlrd” supports old-style Excel files (.xls).

“openpyxl” supports newer Excel file formats.

“odf” supports OpenDocument file formats (.odf, .ods, .odt).

“pyxlsb” supports Binary Excel files.

Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files. When engine=None, the following logic will be used to determine the engine:

If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.

Otherwise if path_or_buffer is an xls format, xlrd will be used.

Otherwise if openpyxl is installed, then openpyxl will be used.

Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.

Otherwise xlrd will be used and a FutureWarning will be raised. This case will raise a ValueError in a future version of pandas.

xlsb reading without index_col:

import pandas as pd

dfcluster = pd.read_excel('c:/xml/baseline/distribucion.xlsb', sheet_name='Cluster', index_col=0, engine='pyxlsb')
GERMAN RODRIGUEZ
  • 397
  • 1
  • 4
  • 9