3

Consider a Pandas Dataframe like:

df = pd.DataFrame([[0,2],[1,0],[7,99]],index=[3,4,8], columns = ["R1","R2"])

Giving:

   R1  R2
3   0   2
4   1   0
8   7  99

When I want to remove a row via a condition on a value I either use

df = df.drop(df[df["R1"] == 1].index)

Or

df = df.drop(df.index[np.where(df["R1"] == 1)[0]])

Or

df = df.drop(df.loc[df['R1'] == 1].index)

Both is super cumbersome. Do you know a easier syntax to achieve this?

For example, if there was something like a idrop function, the second option would be more readably:

df = df.idrop(np.where(df["R1"] == 1)[0])

EDIT:

I had assumed that df = df[df['R1'] != 1] is less performant then just dropping a row. (Huge Database...)

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
newandlost
  • 935
  • 2
  • 10
  • 21
  • 1
    Why do you even need to drop? just select. `df = df.drop(df[df["R1"] == 1].index)` can be `df = df[df['R1'] != 1]` – DeepSpace Nov 13 '18 at 09:28
  • It is not the same question. In the oder question it is asked to to access rows based on a condition. I am asking for the easiest syntax to drop rows. – newandlost Nov 13 '18 at 09:31
  • Is this the same? I had assumed that `df = df[df['R1'] != 1] ` is less performant then just dropping a row. (Huge Database...) – newandlost Nov 13 '18 at 09:33
  • @jezrael it is not a duplicate – newandlost Nov 13 '18 at 09:35
  • Why would it be? Something still needs to check if a given row has that value. If anything, getting only the correct rows from the DB to begin with will (should) be the fastest. Check `pd.read_sql` – DeepSpace Nov 13 '18 at 09:37
  • Thanks a lot for this comparison. It makes sens what you are saying. It anyway must go through the database. Also does not change if using inplace = True. – newandlost Nov 13 '18 at 09:42
  • @newandlost - if use db the fastest is filter in database, like DeepSpace pointed. Btw, in pandas should be obviously simple rule - more functions is slowier - `boolena indexing` vs `boolena indexing + drop` here. – jezrael Nov 13 '18 at 09:44

1 Answers1

8

No, boolean indexing is faster like drop and, if need improve performance, use numexpr:

enter image description here

import perfplot, numexpr

def bi1(df):
    return df[df['R1'].values != 1]

def bi2(df):
    return df[df['R1'] != 1]

def drop1(df):
    return df.drop(df[df["R1"] == 1].index)

def drop2(df):
    return df.drop(df.index[df["R1"] == 1])

def drop3(df):
    return df.drop(df.loc[df['R1'] == 1].index)

def drop4(df):
    return df.drop(np.where(df["R1"] == 1)[0])


def ne(x):
    x = x['R1'].values
    return x[numexpr.evaluate('(x != 1)')]

def q(x):
    return x.query('R1 != 1')

def ev(x):
    return x[x.eval('R1 != 1')]


def make_df(n):
    df = pd.DataFrame({'R1':np.random.randint(100, size=n)})
    return df

perfplot.show(
    setup=make_df,
    kernels=[bi1, bi2,drop1,drop2,drop3,drop4,ne,q,ev],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')
Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252