I am trying to cleanup a dataframe (and the code) in Pandas.
I used to use the modin
library before but I have since switched from Win 10 to Win 11 and none of ray
and dask
libraries work on Windows 11 and I have raised an issue with respective repositories.
In the meanwhile, I am intending to use multiprocessing
as an alternative to the cause however, I think there can be a better manner in how the code can be written.
I intend to use the
import pandas as pd
import multiprocessing as mp
def cleanup(x):
clean df
return x
df = df.apply(cleanup)
way.
example:
df.head(20):
Unnamed: 0 game score home_odds draw_odds away_odds country league datetime
0 0 Sport Recife - Imperatriz 2:2 1.36 4.31 7.66 Brazil Copa do Nordeste 2020 2020-02-07 00:00:00
1 1 ABC - America RN 2:1 2.62 3.30 2.48 Brazil Copa do Nordeste 2020 2020-02-02 22:00:00
2 2 Frei Paulistano - Nautico 0:2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020 2020-02-02 00:00:00
3 3 Botafogo PB - Confianca 1:1 2.06 3.16 3.5 Brazil Copa do Nordeste 2020 2020-02-02 22:00:00
4 4 Fortaleza - Ceara 1:1 2.19 2.98 3.38 Brazil Copa do Nordeste 2020 2020-02-02 22:00:00
5 5 River-PI - CSA 3:1 3.41 3.07 2.1 Brazil Copa do Nordeste 2020 2020-02-02 20:00:00
6 6 Sport Recife - Vitoria 1:1 1.92 3.20 3.93 Brazil Copa do Nordeste 2020 2020-02-02 20:00:00
7 7 CRB - Santa Cruz 1:0 2.22 3.06 3.18 Brazil Copa do Nordeste 2020 2020-01-29 23:30:00
8 8 Bahia - Imperatriz 2:0 1.31 4.75 8.55 Brazil Copa do Nordeste 2020 2020-01-29 00:00:00
9 9 Ceara - Frei Paulistano 2:2 1.2 5.43 12.94 Brazil Copa do Nordeste 2020 2020-01-26 22:00:00
10 10 Confianca - ABC 1:0 2.09 3.01 3.58 Brazil Copa do Nordeste 2020 2020-01-26 00:00:00
11 11 America RN - Botafogo PB 0:0 2.14 3.30 3.11 Brazil Copa do Nordeste 2020 2020-01-26 22:00:00
12 12 CSA - Sport Recife 0:1 2.63 3.04 2.62 Brazil Copa do Nordeste 2020 2020-01-26 22:00:00
13 13 Santa Cruz - Bahia 0:0 3.72 3.22 1.96 Brazil Copa do Nordeste 2020 2020-01-26 20:00:00
14 14 Vitoria - Fortaleza 0:0 3.13 3.02 2.27 Brazil Copa do Nordeste 2020 2020-01-26 20:00:00
15 15 Nautico - River-PI 1:1 1.44 4.06 6.19 Brazil Copa do Nordeste 2020 2020-01-23 23:30:00
16 16 Imperatriz - CRB 2:1 2.34 2.94 3.08 Brazil Copa do Nordeste 2020 2020-01-21 23:00:00
17 17 Nautico - Campinense 2:0 1.4 4.18 7.26 Brazil Copa do Nordeste 2020 2019-05-16 00:30:00
18 18 CRB - Juazeirense 2:1 1.4 4.19 7.32 Brazil Copa do Nordeste 2020 2019-05-14 23:00:00
19 19 ABC - AE Altos 2:0 1.44 3.83 7.02 Brazil Copa do Nordeste 2020 2019-05-08 22:30:00
current code:
import pandas as pd
df = pd.read_csv()
df['game'] = df['game'].astype(str).str.replace('(\(\w+\))', '', regex=True)
df['league'] = df['league'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
df['league'] = df['league'].astype(str).str.replace('é', 'e', regex=True)
df['game'] = df['game'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df['home_score'] = df['home_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df = df[df.home_score != "."]
df = df[df.home_score != ".."]
df = df[df.home_score != "."]
df = df[df.home_odds != "-"]
df = df[df.draw_odds != "-"]
df = df[df.away_odds != "-"]
m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
df = df[~m]
df = df[~n]
df = df[~o]
df = df[df.home_score != '']
df = df[df.away_score != '']
df = df.dropna()
df = df.loc[:, df.columns.intersection(
['datetime', 'country', 'league', 'home_team', 'away_team', 'home_odds', 'draw_odds', 'away_odds', 'home_score',
'away_score'])]
colt = {
'country': str,
'league': str,
'home_team': str,
'away_team': str,
'home_odds': float,
'draw_odds': float,
'away_odds': float,
'home_score': int,
'away_score': int
}
df = df.astype(colt)
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
# Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates
df = df[df['home_odds'] <= 100]
df = df[df['draw_odds'] <= 100]
df = df[df['away_odds'] <= 100]
df = df.drop_duplicates(['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
keep='last')
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.to_csv()
current output:
df:
home_odds draw_odds away_odds country league datetime home_team away_team home_score away_score
0 1.36 4.31 7.66 Brazil Copa do Nordeste 2020-02-07 00:00:00 Sport Recife Imperatriz 2 2
1 2.62 3.30 2.48 Brazil Copa do Nordeste 2020-02-02 22:00:00 ABC America RN 2 1
2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020-02-02 00:00:00 Frei Paulistano Nautico 0 2
3 2.06 3.16 3.50 Brazil Copa do Nordeste 2020-02-02 22:00:00 Botafogo PB Confianca 1 1
4 2.19 2.98 3.38 Brazil Copa do Nordeste 2020-02-02 22:00:00 Fortaleza Ceara 1 1
5 3.41 3.07 2.10 Brazil Copa do Nordeste 2020-02-02 20:00:00 River-PI CSA 3 1
6 1.92 3.20 3.93 Brazil Copa do Nordeste 2020-02-02 20:00:00 Sport Recife Vitoria 1 1
7 2.22 3.06 3.18 Brazil Copa do Nordeste 2020-01-29 23:30:00 CRB Santa Cruz 1 0
8 1.31 4.75 8.55 Brazil Copa do Nordeste 2020-01-29 00:00:00 Bahia Imperatriz 2 0
9 1.20 5.43 12.94 Brazil Copa do Nordeste 2020-01-26 22:00:00 Ceara Frei Paulistano 2 2
10 2.09 3.01 3.58 Brazil Copa do Nordeste 2020-01-26 00:00:00 Confianca ABC 1 0
11 2.14 3.30 3.11 Brazil Copa do Nordeste 2020-01-26 22:00:00 America RN Botafogo PB 0 0
12 2.63 3.04 2.62 Brazil Copa do Nordeste 2020-01-26 22:00:00 CSA Sport Recife 0 1
13 3.72 3.22 1.96 Brazil Copa do Nordeste 2020-01-26 20:00:00 Santa Cruz Bahia 0 0
14 3.13 3.02 2.27 Brazil Copa do Nordeste 2020-01-26 20:00:00 Vitoria Fortaleza 0 0
15 1.44 4.06 6.19 Brazil Copa do Nordeste 2020-01-23 23:30:00 Nautico River-PI 1 1
16 2.34 2.94 3.08 Brazil Copa do Nordeste 2020-01-21 23:00:00 Imperatriz CRB 2 1
17 1.40 4.18 7.26 Brazil Copa do Nordeste 2019-05-16 00:30:00 Nautico Campinense 2 0
18 1.40 4.19 7.32 Brazil Copa do Nordeste 2019-05-14 23:00:00 CRB Juazeirense 2 1
19 1.44 3.83 7.02 Brazil Copa do Nordeste 2019-05-08 22:30:00 ABC AE Altos 2 0