0

I'm having problems reading in the following three seasons of data (all the seasons after these ones load without problem).

import pandas as pd
import itertools

alphabets = ['a','b', 'c', 'd']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 3)]
col_names = keywords[:57]

seasons = [2002, 2003, 2004]

for season in seasons):
    df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names).dropna(how='all')

This gives the following error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 57 fields in line 337, saw 62

I have looked on stack overflow for problems that have a similar error code (see below)but none seem to offer a solution that fits my problem.

Python Pandas Error tokenizing data

I'm pretty sure the error is caused when there is missing data in the last column, however I don't know how to fix it, can someone please explain how to do this?

Thanks

Baz

UPDATE:

The amended code now works for seasons 2002 and 2003. However 2004 is now producing a new error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte

Following the answer below from Serge Ballesta option 2:

UnicodeDecodeError when reading CSV file in Pandas with Python

df = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(str(season)[-2:], str(season+1)[-2:]), names=col_names, encoding = "latin1").dropna(how='all')

With the above amendment the code also works for season=2004.

I still have two questions though:

Q1.) How can I find which character/s were causing the problem is season 2004?

Q2.) Is it safe to use the 'latin1' encoding for every season even though there wre originally encoded at 'utf-8>

Bazman
  • 2,058
  • 9
  • 45
  • 65
  • 1
    this error happens when the first row of the dataset has let's say 5 columns, and then later on a row is found having 7 columns, which confuses python in knowing the actual number of columns. what you could do is to name all the columns first and then import them as a dataframe – aayush_malik Nov 20 '19 at 07:24
  • Hi there, please see the amended code and update section of the question above. – Bazman Nov 20 '19 at 13:44

0 Answers0