Failed to download full rows using Pandas read_excel() for xlsx file

Question

The file is supposed to have thousands number of rows. But using below it only returns the first couple of rows in dataframe

File https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx

Failed example

import pandas as pd

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
df = pd.read_excel(url, engine='openpyxl', header=2, usecols='A:D', verbose=True)
print(df.shape)

# output - only 5 rows
Reading sheet 0
(5, 4)

Working example

Same file. Downloaded it first, opened up in Excel, modifed a text and saved (didn't change format and keep xlsx) and then use read_excel() to open from file

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
file = os.path.join(path, 'ListOfSecurities.xlsx')

# open to edit and then save in Excel

df = pd.read_excel(file, engine='openpyxl', header=2, usecols='A:D', verbose=True)
print(df.shape)

# output
Reading sheet 0
(17490, 4)

The problem might be caused by the data being used. You should include some sample data so helpers replicate the issue. — Gi1ber7, Dec 24 '20 at 01:46
downloaded and noticed the same issue. shape (5, 4). data format issue. "Spread Table" and "Board Lot" columns issue — Aaj Kaal, Dec 24 '20 at 02:25
But this is still valid excel format. So looking for a solution here — memento, Dec 24 '20 at 02:39

Aasim Sani · Accepted Answer · 2020-12-24T02:46:32.980

0

UPDATED: Changes to code based on context that xlrd isn't viable to use

import pandas as pd
import os 
import wget

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
filename = os.path.join(path, 'ListOfSecurities.xlsx')

from openpyxl import load_workbook

excel_file = load_workbook(filename)
sheet = excel_file["ListOfSecurities"]
sheet.delete_cols(5,21) # Use only Cols A:D

data = sheet.values
cols = next(data) # Skip row 0
cols = next(data) # Skip row 1
cols = next(data)[0:4] # Cols A:D


df = pd.DataFrame(data, columns=cols)

print(df.shape)

I changed the excel engine to use the default one (xlrd) in pandas and the following code worked.

import pandas as pd
import os 
import wget

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
filename = os.path.join(path, 'ListOfSecurities.xlsx')

df = pd.read_excel(filename, header=2, usecols='A:D', verbose=True)
print(df.shape)

One inconsistency in the output is that it shows 4 fewer rows:

Reading sheet 0
(17486, 4)

edited Dec 24 '20 at 02:46

answered Dec 24 '20 at 01:53

Aasim Sani

339
2
6

Thanks Aasim. However the version of xlrd I am using does not support xlsx file. xlrd.biffh.XLRDError: Excel xlsx file; not supported. xlrd == 2.0.1 pandas == 1.1.5 https://stackoverflow.com/questions/65254535/xlrd-biffh-xlrderror-excel-xlsx-file-not-supported – memento Dec 24 '20 at 02:00
What version of Pandas are you using? – AMC Dec 24 '20 at 02:36
pandas == 1.1.5 – memento Dec 24 '20 at 02:39
@memento Made changes to the answer to use ```openpyxl``` itself to open the file and then load it into ```pandas```. Let me know if that works. – Aasim Sani Dec 24 '20 at 02:45
works perfectly, thanks – memento Dec 24 '20 at 03:33

Failed to download full rows using Pandas read_excel() for xlsx file

1 Answers1