Pandas read_csv not working when items missing from last column

Question

I'm having problems reading in the following three seasons of data (all the seasons after these ones load without problem).

import pandas as pd
import itertools

alphabets = ['a','b', 'c', 'd']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 3)]
col_names = keywords[:57]

seasons = [2002, 2003, 2004]

for season in seasons):
    df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names).dropna(how='all')

This gives the following error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 57 fields in line 337, saw 62

I have looked on stack overflow for problems that have a similar error code (see below)but none seem to offer a solution that fits my problem.

Python Pandas Error tokenizing data

I'm pretty sure the error is caused when there is missing data in the last column, however I don't know how to fix it, can someone please explain how to do this?

Thanks

Baz

UPDATE:

The amended code now works for seasons 2002 and 2003. However 2004 is now producing a new error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte

Following the answer below from Serge Ballesta option 2:

UnicodeDecodeError when reading CSV file in Pandas with Python

df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names, encoding = "latin1").dropna(how='all')

With the above amendment the code also works for season=2004.

I still have two questions though:

Q1.) How can I find which character/s were causing the problem is season 2004?

Q2.) Is it safe to use the 'latin1' encoding for every season even though there wre originally encoded at 'utf-8>

this error happens when the first row of the dataset has let's say 5 columns, and then later on a row is found having 7 columns, which confuses python in knowing the actual number of columns. what you could do is to name all the columns first and then import them as a dataframe — aayush_malik, Nov 20 '19 at 07:24
Hi there, please see the amended code and update section of the question above. — Bazman, Nov 20 '19 at 13:44

Pandas read_csv not working when items missing from last column

0 Answers0