I'm having problems reading in the following three seasons of data (all the seasons after these ones load without problem).
import pandas as pd
import itertools
alphabets = ['a','b', 'c', 'd']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 3)]
col_names = keywords[:57]
seasons = [2002, 2003, 2004]
for season in seasons):
df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names).dropna(how='all')
This gives the following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 57 fields in line 337, saw 62
I have looked on stack overflow for problems that have a similar error code (see below)but none seem to offer a solution that fits my problem.
Python Pandas Error tokenizing data
I'm pretty sure the error is caused when there is missing data in the last column, however I don't know how to fix it, can someone please explain how to do this?
Thanks
Baz
UPDATE:
The amended code now works for seasons 2002 and 2003. However 2004 is now producing a new error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
Following the answer below from Serge Ballesta option 2:
UnicodeDecodeError when reading CSV file in Pandas with Python
df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names, encoding = "latin1").dropna(how='all')
With the above amendment the code also works for season=2004.
I still have two questions though:
Q1.) How can I find which character/s were causing the problem is season 2004?
Q2.) Is it safe to use the 'latin1' encoding for every season even though there wre originally encoded at 'utf-8>