0

I can download files:

seasons = [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]

epl_tables = {}
epl_seasons = {}
for year in seasons:
    start_year = str(year)[-2:]
    end_year = str(year+1)[-2:]
    season = start_year + end_year
    epl_seasons[season] = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(start_year, end_year)).dropna(how='all')
    epl_tables[season] = league(epl_seasons[season]) 

This works fine.

However when I try to add the 2004-05 season by adding 2004 to seasons, there is a problem and the code fails.

seasons = [2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]

The problem is caused by a white space before the referee's name in rows 337 to 345 of the csv file.

I can work around by manually deleting the white space and then loading from disk this works but obviously it's not ideal.

I've tried various ways to get it to work such as shown below but nothing seems to work,

epl_seasons[season] = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(start_year, end_year), delimiter=',', encoding="utf-8", skipinitialspace=True).dropna(how='all')

A potential complication is that when I open the file in excel the space appears as a white space but then I open it in libreCalc (in Ubuntu which is what I'm working in), it appears as an unknown character a question mark in a black box tipped at 45 degrees. See the answer from PeterMau in the link below to see what this unknown character looks like.

https://ask.libreoffice.org/en/question/113125/characters-turned-into-question-marks/

Can someone please tell me who I can automatically remove these white spaces/ unknown characters?

Bazman
  • 2,058
  • 9
  • 45
  • 65
  • 1
    Since you're using Pandas, can you just isolate the row that's problematic (referee names column I'm guessing) and simply delete all the white space before the character in the CSV file before you try to import everything as a dictionary first? – Hofbr Dec 30 '19 at 16:03
  • That works as a manual work around but I'd like to automate the process. – Bazman Dec 30 '19 at 16:07
  • I offered my solution below. I don't think it's any more of a manual process. You'd just need a line after read in the CSV file with pandas. – Hofbr Dec 30 '19 at 16:16

2 Answers2

1

You can remove white space in a string with .str.strip():

epl_seasons[season]['COLUMN NAME'] = epl_seasons[season]['COLUMN NAME'].str.strip()

This shouldn't be a manual process. Just add a line so that when you import a CSV file you also cleanup the problematic column.

Obviously this only works for a specific column. Here's an answer on a different thread that addresses removing white space from every df cell:

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

Hofbr
  • 868
  • 9
  • 31
  • Tnx, so I need to download the file and then amend the referee column as shown above (a process which can obviously be automated.) I then import this cleaned up file into pandas. This should work fine it's just a pity there was not a 100% pandas solution to deal directly with unknown/white space characters in the download data. – Bazman Dec 30 '19 at 16:32
  • Yeah I think it should work. Though, I believe your epl_seasons[season] is a dataframe. the str.strip() is a Pandas Method so it's 100% pandas but it's just specific to whatever Column name (guessing a referees column is causing the problem). I'll amend my answer in case you'd rather do it to every column and not a single column – Hofbr Dec 30 '19 at 16:39
0

Just run the str.strip() function on my data columns and also passed the character causing the issue which is ? in my case, as follows:

df[newcol]= df[oldCol].str.strip('?')
pppery
  • 3,731
  • 22
  • 33
  • 46