0

I am trying to read a csv-file from given URL using Python 3.

import pandas as pd
url = 'https://www.hkex.com.hk/eng/dwrc/search/dwFullList.csv' # error
url_2 = 'https://www.cboe.com/us/options/symboldir/equity_index_options/?download=csv
df = pd.read_csv(url) # error
df = pd.read_csv(url_2) # can download csv from url

When I run df = pd.read_csv(url) the system return:

File "pandas\_libs\parsers.pyx", line 537, in pandas._libs.parsers.TextReader.__cinit__
File "pandas\_libs\parsers.pyx", line 740, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

However, when I run df = pd.read_csv(url_2) the system can return the dataframe. How can I solve this problem? I am using Python 3.7.

John Conde
  • 217,595
  • 99
  • 455
  • 496

2 Answers2

0

This is caused by an unexpected header. If you look at the file, the first line is some 'updated' line, which is not part of the CSV. You should thus pass skiprows to the read_csv.

df = pd.read_csv(url, skiprows=[1])

See also this other question.

My initial answer assumed no header was present, but it turns out it was shifted 1 line down.

fravolt
  • 2,565
  • 1
  • 4
  • 19
  • I have tried skiprows = [1] or header = Non. Both are not work and show [pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3] – python_learner May 05 '21 at 13:13
  • It looks like it may actually be a .tsv file (tab seperated values), despite being marked csv. Does it work if you use `sep='\t'` as an argument? – fravolt May 05 '21 at 13:31
0

First of all, let's understand about error. The error you are facing was stated below:-

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

  • You have noticed that our error type is the UnicodeDecodeError with 0xff Codec.

Why this error occurred and how to resolve it?

In our case pd.read_csv() module use encoding = 'utf-8' for Encoding Data. and you are facing error with 0xff Codec. So, 0xff is a number represented in the hexadecimal numeral system (base 16). It's composed of two f numbers in hex. As we know, f in hex is equivalent to 1111 in the binary numeral system.

  • Solution:- Use encoding = 'utf-16' while fetching Data.

After this scenario, you may face Error tokenizing data. C error: Expected 1 fields in line 3, saw 3 Error Which has been occurred due to Separation Error of header and footer. So, the solution for your query was given below:-

# Import all the important Libraries
import pandas as pd 

# Fetch 'CSV' Data Using 'URL' and store it in 'df'
url = 'https://www.hkex.com.hk/eng/dwrc/search/dwFullList.csv'
df = pd.read_csv(url, encoding = 'utf-16', sep = '\t', error_bad_lines = False, skiprows = 1, skipfooter = 3, engine = 'python')

# Print a few records of df
df.head()

Output of Above Cell:- Output of Above Code

To Learn more about pd.read_csv():- Click Here !!!
To Learn more about Encoding List:- Click Here !!!

As you can see we have achieved our desired Output. Hope this Solution helps you.

Jay Patel
  • 545
  • 4
  • 11