1

The file is supposed to have thousands number of rows. But using below it only returns the first couple of rows in dataframe

File https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx

Failed example

import pandas as pd

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
df = pd.read_excel(url, engine='openpyxl', header=2, usecols='A:D', verbose=True)
print(df.shape)
# output - only 5 rows
Reading sheet 0
(5, 4)

Working example

Same file. Downloaded it first, opened up in Excel, modifed a text and saved (didn't change format and keep xlsx) and then use read_excel() to open from file

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
file = os.path.join(path, 'ListOfSecurities.xlsx')

# open to edit and then save in Excel

df = pd.read_excel(file, engine='openpyxl', header=2, usecols='A:D', verbose=True)
print(df.shape)
# output
Reading sheet 0
(17490, 4)
memento
  • 13
  • 3
  • The problem might be caused by the data being used. You should include some sample data so helpers replicate the issue. – Gi1ber7 Dec 24 '20 at 01:46
  • downloaded and noticed the same issue. shape (5, 4). data format issue. "Spread Table" and "Board Lot" columns issue – Aaj Kaal Dec 24 '20 at 02:25
  • But this is still valid excel format. So looking for a solution here – memento Dec 24 '20 at 02:39

1 Answers1

0

UPDATED: Changes to code based on context that xlrd isn't viable to use

import pandas as pd
import os 
import wget

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
filename = os.path.join(path, 'ListOfSecurities.xlsx')

from openpyxl import load_workbook

excel_file = load_workbook(filename)
sheet = excel_file["ListOfSecurities"]
sheet.delete_cols(5,21) # Use only Cols A:D

data = sheet.values
cols = next(data) # Skip row 0
cols = next(data) # Skip row 1
cols = next(data)[0:4] # Cols A:D


df = pd.DataFrame(data, columns=cols)

print(df.shape)

I changed the excel engine to use the default one (xlrd) in pandas and the following code worked.

import pandas as pd
import os 
import wget

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
filename = os.path.join(path, 'ListOfSecurities.xlsx')

df = pd.read_excel(filename, header=2, usecols='A:D', verbose=True)
print(df.shape)

One inconsistency in the output is that it shows 4 fewer rows:

Reading sheet 0
(17486, 4)
Aasim Sani
  • 339
  • 2
  • 6
  • Thanks Aasim. However the version of xlrd I am using does not support xlsx file. xlrd.biffh.XLRDError: Excel xlsx file; not supported. xlrd == 2.0.1 pandas == 1.1.5 https://stackoverflow.com/questions/65254535/xlrd-biffh-xlrderror-excel-xlsx-file-not-supported – memento Dec 24 '20 at 02:00
  • What version of Pandas are you using? – AMC Dec 24 '20 at 02:36
  • pandas == 1.1.5 – memento Dec 24 '20 at 02:39
  • @memento Made changes to the answer to use ```openpyxl``` itself to open the file and then load it into ```pandas```. Let me know if that works. – Aasim Sani Dec 24 '20 at 02:45
  • works perfectly, thanks – memento Dec 24 '20 at 03:33