3

I have a dataframe, and the demo is generated by generate_data().

  1. If the first value in the data column is false, return 0.
  2. If the first value of the data column is true, return the order of the last position of consecutive true.

I wrote two methods: sort_data() and sort_data2()

%timeit sort_order(df.copy())
1.12 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sort_order2(df.copy())
715 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there a faster way?

My code is as follows:

import pandas as pd
import numpy as np


def generate_data():
    order = range(1,7)
    data = [True, True, False, False, True, False]
    c = {'order': order,
         'data': data}
    df = pd.DataFrame(c)
    return df


def sort_order(df):
    order_first_false = df.loc[~df.data, 'order']
    if len(order_first_false) == 0:
        order_last_true = df.order.values[-1]
    else:
        order_first_false = order_first_false.values[0]
        df = df[df.order < order_first_false]
        if len(df):
            order_last_true = df.order.values[-1]
        else:
            order_last_true = 0
    return order_last_true


def sort_order2(df):
    groups = df[f'data'].ne(True).cumsum()
    len_true = len(groups[groups == 0])
    if len_true:
        order_last_true = df.at[df.index[len_true - 1], 'order'].max()
    else:
        order_last_true = 0
    return order_last_true


def main():
    df = generate_data()
    print(df)

    order_last_true = sort_order(df.copy())
    print(order_last_true)

    order_last_true = sort_order2(df.copy())
    print(order_last_true)


if __name__ == '__main__':
    main()

The result I respected is :

   order   data
0      1   True
1      2   True
2      3  False
3      4  False
4      5   True
5      6  False

2

2

jaried
  • 632
  • 4
  • 15

2 Answers2

4

Use numba for processing values to first Trues block, inspiration by this solution:

from numba import njit

@njit
def sort_order3(a, b):
    if not a[0]:
        return 0
    else:
        for i in range(1, len(a)):
            if not a[i]:
                return b[i - 1]
        return b[-1]


  
df = generate_data()
print (sort_order3(df['data'].to_numpy(), df['order'].to_numpy()))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for your answer. Your answer is very valuable to me . %timeit sort_order3(df.copy()['data'].to_numpy(), df.copy()['order'].to_numpy()) 300 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) – jaried Dec 11 '21 at 05:13
1

Maybe I am missing something but why dont you just get the index of the first False in df.data then use that index to get the value in the df.order column?

For example:

def sort_order3(df):
    try:
        idx = df.data.to_list().index(False)
    except ValueError: # meaning there is no False in df.data
        idx = df.data.size - 1
    return df.order[idx]

Or for really large data numpy might be faster:

def sort_order4(df):
    try:
        idx = np.argwhere(~df.data.values)[0, 0]
    except IndexError: # meaning there is no False in df.data
        idx = df.data.size - 1
    return df.order[idx]

The timing on my device:

%timeit sort_order(df.copy())
565 µs ± 6.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit sort_order2(df.copy())
443 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit sort_order3(df.copy())
96.5 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit sort_order4(df.copy())
112 µs ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Andre
  • 760
  • 3
  • 13
  • Thanks. %timeit sort_order3(df.copy()) 226 µs ± 3.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)%timeit sort_order4(df.copy()) 250 µs ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) – jaried Dec 11 '21 at 04:59
  • There is a problem when the first value is false. – jaried Dec 13 '21 at 16:01
  • What is the issue? – Andre Dec 13 '21 at 16:10
  • If change data = [False, True, False, False, True, False], sort_order3 and sort_order4 got 1, expected value is 0. – jaried Dec 13 '21 at 16:15
  • The function returns the value from the column `df.order` at the index of the first `False` in the `df.data` column. So if the index is 0 the funcion retruns `df.order[0]`, which is 1 (`as df.order == range(1,7)`). – Andre Dec 14 '21 at 08:10
  • I wrote in the question, If the first value in the data column is false, return 0. – jaried Dec 17 '21 at 11:18