0

I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient. Can my code be simplified and made more efficient? (I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).

import pandas as pd

path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"

racecard_path = path + hist_file
df = pd.read_excel(racecard_path)

df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:

cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
    df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
    df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
    df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
    df.loc[cnt,"Mask"] = 1
else: 
    df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]

df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)

Sample data/output:

HxRun   HxFPos  HxTotalBtn  Mask    Total
7   5   8   0   0
13  3   2.75    1   3
12  5   3.75    0   0
11  5   5.75    0   0
11  7   9.25    0   0
11  9   14.5    0   0
10  10  26.75   0   0
8   4   19.5    1   4
8   8   67  0   0
smci
  • 32,567
  • 20
  • 113
  • 146
Chris M
  • 27
  • 9
  • Don't say 'Python' when you mean 'pandas' :). pandas has a much richer vectorized set of operators and methods than base Python. Also ***vectorized*** is the term for *"without a for-loop"*. – smci Feb 16 '22 at 20:22
  • We can't reproduce this (no [mcve]) since we don't have your input xlsx file or input data. (In particular we don't know what `df["Mask"]` was before your code modifies it). You need to replace that xlsx file-reading code with some code that defines a `pd.DataFrame(...)` – smci Feb 16 '22 at 20:28
  • Also the indentation on your for-loop contents is missing, please fix it, otherwise this won't execute. – smci Feb 16 '22 at 20:34

2 Answers2

0

Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column

by @Hossein-Kalbasi

I've just found an answer - please comment if this is not the most efficient.

df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1
Chris M
  • 27
  • 9
  • This is *really* hard to follow, please add spaces between the main subexpressions; also I don't get where `(df['HxRun']<5)` came from, it wasn't in the original. If you transformed the clause ranges (off-by-one?) or something, explain in words. (And you could still have slight bugs, just because this passes on your sample data doesn't mean it is entirely correct, `df['HxRun']<5` and `df['HxRun']>4` will both give True on float input like 4.5) – smci Feb 16 '22 at 21:03
  • Thanks again @smci and my apologies for my coding being messy. The '(df['HxRun']<5)' was an amendment to the original conditions. The column values used in the conditions are integers so float inputs will not occur. Cheers – Chris M Feb 16 '22 at 22:16
  • 1
    Perhaps, but the logical inversion of `HxRun<5` is `HxRun>=5`, not `HxRun>=6`. Don't get too sloppy with assuming integer domain for everything, someday it will bite you... – smci Feb 16 '22 at 22:21
  • Re the quesrion you cited: yes you could use `df.where...` but like I keep stressing you have a complex condition involving three columns. – smci Feb 16 '22 at 22:23
0

Use df.assign() for a complex vectorized expression

Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:

  • .loc[]
  • df.assign()
  • or alternatively df.query (if you like SQL syntax)

or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:

  • df.apply(your_function_or_lambda, axis=1)
  • or df.iterrows() as a fallback

df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.

Solution with df.assign()

Rewrite your fomula for clarity

When we remove all the unneeded .loc[] calls your formula boils down to:

HxFPos > 6 or HxTotalBtn > 30: 
    Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10: 
    Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10: 
    Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20: 
    Mask = 1
else: 
    Mask = 0 

pandas doesn't have a native case-statement/method. Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:

(f > 6) or (btn > 30):
    Mask = 0
(f < 2) and (r < 4) and (btn < 10):
    Mask = 1
(f < 4) and (r < 9) and (btn < 10):
    Mask = 1
(f < 5) and (r < 20) and (btn < 20): 
    Mask = 1
else:
    Mask = 0

So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)

Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest 

Vectorize your expressions

So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:

>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)

0    False
1    False
2    False
3    False
4     True
5     True
6     True
7    False
8     True
dtype: bool

Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].

Similarly:

((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)
smci
  • 32,567
  • 20
  • 113
  • 146
  • Thanks @smci - your comments and answer are much appreciated. My previous coding has been limited: BASIC and excel VBA so I'm still trying to get to grips with python etc. – Chris M Feb 16 '22 at 22:10