0

I am trying to cleanup a dataframe (and the code) in Pandas.

I used to use the modin library before but I have since switched from Win 10 to Win 11 and none of ray and dask libraries work on Windows 11 and I have raised an issue with respective repositories.

In the meanwhile, I am intending to use multiprocessing as an alternative to the cause however, I think there can be a better manner in how the code can be written.

I intend to use the

import pandas as pd
import multiprocessing as mp
def cleanup(x):
    clean df
    return x
df = df.apply(cleanup)

way.

example:

df.head(20):
    Unnamed: 0                       game score home_odds draw_odds away_odds country                 league             datetime
0            0  Sport Recife - Imperatriz   2:2      1.36      4.31      7.66  Brazil  Copa do Nordeste 2020  2020-02-07 00:00:00
1            1           ABC - America RN   2:1      2.62      3.30      2.48  Brazil  Copa do Nordeste 2020  2020-02-02 22:00:00
2            2  Frei Paulistano - Nautico   0:2      5.19      3.58      1.62  Brazil  Copa do Nordeste 2020  2020-02-02 00:00:00
3            3    Botafogo PB - Confianca   1:1      2.06      3.16       3.5  Brazil  Copa do Nordeste 2020  2020-02-02 22:00:00
4            4          Fortaleza - Ceara   1:1      2.19      2.98      3.38  Brazil  Copa do Nordeste 2020  2020-02-02 22:00:00
5            5             River-PI - CSA   3:1      3.41      3.07       2.1  Brazil  Copa do Nordeste 2020  2020-02-02 20:00:00
6            6     Sport Recife - Vitoria   1:1      1.92      3.20      3.93  Brazil  Copa do Nordeste 2020  2020-02-02 20:00:00
7            7           CRB - Santa Cruz   1:0      2.22      3.06      3.18  Brazil  Copa do Nordeste 2020  2020-01-29 23:30:00
8            8         Bahia - Imperatriz   2:0      1.31      4.75      8.55  Brazil  Copa do Nordeste 2020  2020-01-29 00:00:00
9            9    Ceara - Frei Paulistano   2:2       1.2      5.43     12.94  Brazil  Copa do Nordeste 2020  2020-01-26 22:00:00
10          10            Confianca - ABC   1:0      2.09      3.01      3.58  Brazil  Copa do Nordeste 2020  2020-01-26 00:00:00
11          11   America RN - Botafogo PB   0:0      2.14      3.30      3.11  Brazil  Copa do Nordeste 2020  2020-01-26 22:00:00
12          12         CSA - Sport Recife   0:1      2.63      3.04      2.62  Brazil  Copa do Nordeste 2020  2020-01-26 22:00:00
13          13         Santa Cruz - Bahia   0:0      3.72      3.22      1.96  Brazil  Copa do Nordeste 2020  2020-01-26 20:00:00
14          14        Vitoria - Fortaleza   0:0      3.13      3.02      2.27  Brazil  Copa do Nordeste 2020  2020-01-26 20:00:00
15          15         Nautico - River-PI   1:1      1.44      4.06      6.19  Brazil  Copa do Nordeste 2020  2020-01-23 23:30:00
16          16           Imperatriz - CRB   2:1      2.34      2.94      3.08  Brazil  Copa do Nordeste 2020  2020-01-21 23:00:00
17          17       Nautico - Campinense   2:0       1.4      4.18      7.26  Brazil  Copa do Nordeste 2020  2019-05-16 00:30:00
18          18          CRB - Juazeirense   2:1       1.4      4.19      7.32  Brazil  Copa do Nordeste 2020  2019-05-14 23:00:00
19          19             ABC - AE Altos   2:0      1.44      3.83      7.02  Brazil  Copa do Nordeste 2020  2019-05-08 22:30:00

current code:

import pandas as pd

df = pd.read_csv()
df['game'] = df['game'].astype(str).str.replace('(\(\w+\))', '', regex=True)
df['league'] = df['league'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
df['league'] = df['league'].astype(str).str.replace('é', 'e', regex=True)
df['game'] = df['game'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df['home_score'] = df['home_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df = df[df.home_score != "."]
df = df[df.home_score != ".."]
df = df[df.home_score != "."]
df = df[df.home_odds != "-"]
df = df[df.draw_odds != "-"]
df = df[df.away_odds != "-"]
m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
df = df[~m]
df = df[~n]
df = df[~o]
df = df[df.home_score != '']
df = df[df.away_score != '']
df = df.dropna()
df = df.loc[:, df.columns.intersection(
    ['datetime', 'country', 'league', 'home_team', 'away_team', 'home_odds', 'draw_odds', 'away_odds', 'home_score',
     'away_score'])]

colt = {
    'country': str,
    'league': str,
    'home_team': str,
    'away_team': str,
    'home_odds': float,
    'draw_odds': float,
    'away_odds': float,
    'home_score': int,
    'away_score': int
}
df = df.astype(colt)
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

# Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates
df = df[df['home_odds'] <= 100]
df = df[df['draw_odds'] <= 100]
df = df[df['away_odds'] <= 100]
df = df.drop_duplicates(['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
                        keep='last')
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

df.to_csv()

current output:

df:
home_odds  draw_odds  away_odds country            league             datetime        home_team        away_team  home_score  away_score
0        1.36       4.31       7.66  Brazil  Copa do Nordeste  2020-02-07 00:00:00     Sport Recife       Imperatriz           2           2
1        2.62       3.30       2.48  Brazil  Copa do Nordeste  2020-02-02 22:00:00              ABC       America RN           2           1
2        5.19       3.58       1.62  Brazil  Copa do Nordeste  2020-02-02 00:00:00  Frei Paulistano          Nautico           0           2
3        2.06       3.16       3.50  Brazil  Copa do Nordeste  2020-02-02 22:00:00      Botafogo PB        Confianca           1           1
4        2.19       2.98       3.38  Brazil  Copa do Nordeste  2020-02-02 22:00:00        Fortaleza            Ceara           1           1
5        3.41       3.07       2.10  Brazil  Copa do Nordeste  2020-02-02 20:00:00         River-PI              CSA           3           1
6        1.92       3.20       3.93  Brazil  Copa do Nordeste  2020-02-02 20:00:00     Sport Recife          Vitoria           1           1
7        2.22       3.06       3.18  Brazil  Copa do Nordeste  2020-01-29 23:30:00              CRB       Santa Cruz           1           0
8        1.31       4.75       8.55  Brazil  Copa do Nordeste  2020-01-29 00:00:00            Bahia       Imperatriz           2           0
9        1.20       5.43      12.94  Brazil  Copa do Nordeste  2020-01-26 22:00:00            Ceara  Frei Paulistano           2           2
10       2.09       3.01       3.58  Brazil  Copa do Nordeste  2020-01-26 00:00:00        Confianca              ABC           1           0
11       2.14       3.30       3.11  Brazil  Copa do Nordeste  2020-01-26 22:00:00       America RN      Botafogo PB           0           0
12       2.63       3.04       2.62  Brazil  Copa do Nordeste  2020-01-26 22:00:00              CSA     Sport Recife           0           1
13       3.72       3.22       1.96  Brazil  Copa do Nordeste  2020-01-26 20:00:00       Santa Cruz            Bahia           0           0
14       3.13       3.02       2.27  Brazil  Copa do Nordeste  2020-01-26 20:00:00          Vitoria        Fortaleza           0           0
15       1.44       4.06       6.19  Brazil  Copa do Nordeste  2020-01-23 23:30:00          Nautico         River-PI           1           1
16       2.34       2.94       3.08  Brazil  Copa do Nordeste  2020-01-21 23:00:00       Imperatriz              CRB           2           1
17       1.40       4.18       7.26  Brazil  Copa do Nordeste  2019-05-16 00:30:00          Nautico       Campinense           2           0
18       1.40       4.19       7.32  Brazil  Copa do Nordeste  2019-05-14 23:00:00              CRB      Juazeirense           2           1
19       1.44       3.83       7.02  Brazil  Copa do Nordeste  2019-05-08 22:30:00              ABC         AE Altos           2           0
martineau
  • 119,623
  • 25
  • 170
  • 301
leonardo
  • 140
  • 10
  • What is your motivation behind wanting to use multiprocessing? MP is almost always a bad idea if your data has to be merged at the end simply because you have to have some synchronization, IPC, which is all clunky unless done right and almost unnecessary. You perhaps might want to benchmark sections of your code before you decide to make a jump to any form of concurrency. – sarthakt May 02 '22 at 10:30
  • 1
    This thread can be of help https://stackoverflow.com/questions/40357434/pandas-df-iterrows-parallelization – alec_djinn May 02 '22 at 10:38
  • @sarthakt The the dataframe is a pretty big one (500MB) and `df = df.astype(colt)` takes the most amount of time. `modin` worked well as it processed the said dataframe in 2 minutes while the current code with pandas takes around 35 mins to execute. I have a 16 core processor system which can be efficiently utilised with multiprocessing – leonardo May 02 '22 at 10:42
  • 1
    I just saw what `modin`, `dask` and `ray` do and a lot of that is using threading. However, I also suspect that they are very low on data copies internally as a lot of their code is written in C++ (pointers and references doing their job). Most splices in python are memory heavy due to the copies they create. Most performant python libs avoid this by being written in C/C++. The thread posted by @alec_djinn might be of use, but I have my reservations about the significance of gains you might get. You are free to try it though, let us know of the improvements that you see. – sarthakt May 02 '22 at 10:55
  • I tried to run the same code in VSCode with `modin` and `ray` and it worked. (Earlier I was trying in Pycharm) I guess the issue must be with path. I miss the familiarity with Pycharm IDE but, VSCode does not seem to have a steep learning curve. – leonardo May 05 '22 at 10:57

0 Answers0