-1

I'm trying to do the following: "# drop all rows where tag == train_loop and start is NaN".

Here's my current attempt (thanks Copilot):

# drop all rows where tag == train_loop and start is NaN
# apply filter function to each row
# return True if row should be dropped
def filter_fn(row):
    return row["tag"] == "train_loop" and pd.isna(row["start"]):

old_len = len(df)
df = df[~df.apply(filter_fn, axis=1)]

It works well, but I'm wondering if there is a less verbose way.

Foobar
  • 7,458
  • 16
  • 81
  • 161

2 Answers2

2

using apply is a really bad way to do this actually, since it loops over every row, calling the function you defined in python. Instead, use vectorized functions which you can call on the entire dataframe, which call optimized/vectorized versions written in C under the hood.

df = df[~((df["tag"] == "train_loop") & df["start"].isnull())]

If your data is large (>~100k rows), then even faster would be to use pandas query methods, where you can write both conditions in one:

df = df.query(
    '~((tag == "train_loop") and (start != start))'
)

This makes use of the fact that NaNs never equal anything, including themselves, so we can use simple logical operators to find NaNS (.isnull() isn't available in the compiled query mini-language). For the query method to be faster, you need to have numexpr installed, which will compile your queries on the fly before they're called on the data.

See the docs on enhancing performance for more info and examples.

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Do you know why pandas uses `~` and `&` instead of `and` and `not`, is it to support operator overloading? – Foobar Nov 14 '22 at 01:39
  • `~` and `&` are bitwise operators which act on all elements of an array element-wise. `and` and `not` are logical operators which act on single boolean values. you should do `True and True`, `False or True`; but you should do `~[True, True, False] --> [False, False, True]` and `[True, True] & [False, True] --> [False, True]`. (these examples work with pandas/numpy arrays, not actual python lists). These operators are overloaded from actual bitwise ops, e.g. `~1100011` --> `0011100` – Michael Delgado Nov 14 '22 at 02:07
  • this is a super important distinction in numpy/pandas - see this: https://stackoverflow.com/questions/21415661/logical-operators-for-boolean-indexing-in-pandas – Michael Delgado Nov 14 '22 at 02:12
  • Thanks! Now, let's say I wanted to check if `df["start"]` was true? How might I go about doing that? – Foobar Nov 14 '22 at 02:48
1

You can do

df = df.loc[~(df['tag'].eq('train_loop') & df['start'].isna())]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Small question for clarity/learning: This answer is interesting because the two conditions are encapsulated by one bracket pair, where the usual multi-condition syntax seems to be ((condition 1) & (condition 2)). The code works as intended, but are both variants possible under normal select conditions, or is this only on account of the tilde negation? – PeptideWitch Nov 14 '22 at 01:14
  • 2
    @PeptideWitch that is due to we use . chain with all conditions , which dose not require Parentheses for each condition. – BENY Nov 14 '22 at 01:19
  • 1
    both are the same - there are no parens needed around `df["tag"].eq("train_loop")` because it's a single chained method call, so there's no ambiguity about precedence. you'd want parentheses around `(A == B) & (C == D)` to clarify that you mean to use the bitwise operator on the results of the equality operations, rather than meaning, e.g. `A == (B & C) == D`. The bitwise operator `&` has higher precedence than `==` in python, so this isn't actually just good style - it's required to indicate the former. Try `1 == 1 & 2 == 2` (evaluates False) to see what I mean – Michael Delgado Nov 14 '22 at 01:19