Step left rows values until not null in python

Question

I have to handle a huge amount of data. Every row starts with 1 or 0. I need a dataframe where every rows start with 1, so I have to step left all rows values till the first value is 1.

For example:

0 1 0 0 1 0 0
1 0 0 0 0 1 1
0 0 0 1 0 0 1
0 0 0 0 0 1 1

The result has to be this:

1 0 0 1 0 0 0
1 0 0 0 0 1 1
1 0 0 1 0 0 0
1 1 0 0 0 0 0

I don't want to use for, while, etc., because I need some faster methods with pandas or numpy.

Do you have idea for this problem?

@pault want to flag as dupe? Just noticed my solution is basically the accepted answer there. — user3483203, Oct 16 '18 at 17:10
@user3483203 I don't know- it depends on if the desired shift is circular, no? — pault, Oct 16 '18 at 17:22
@pault the shift being circular should never matter, because you will always stop when a 1 is in index 0, which means that a 1 will never wrap around. Either way, the other answer doesn't talk about how to determine the offset, so I think it's fine to leave this open — user3483203, Oct 16 '18 at 18:03

score 4 · Answer 1 · answered Oct 16 '18 at 16:35

You may using with cummax to mask all position need to shift as NaN and sorted

df[df.cummax(1).ne(0)].apply(lambda x : sorted(x,key=pd.isnull),1).fillna(0).astype(int)
Out[310]: 
   1  2  3  4  5  6  7
0  1  0  0  1  0  0  0
1  1  0  0  0  0  1  1
2  1  0  0  1  0  0  0
3  1  1  0  0  0  0  0

Or we using the function justify write by Divakar(much faster than the apply sorted)

pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).astype(int)
Out[314]: 
   0  1  2  3  4  5  6
0  1  0  0  1  0  0  0
1  1  0  0  0  0  1  1
2  1  0  0  1  0  0  0
3  1  1  0  0  0  0  0

Man, Divakar is a numpy god, damn – rafaelc Oct 16 '18 at 16:40 — rafaelc, Oct 16 '18 at 16:40

user3483203 · Answer 2 · 2018-10-16T17:00:47.113

You can make use of numpy.ogrid here:

a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
a[r, c]

array([[1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 1],
       [1, 0, 0, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0]], dtype=int64)

Timings

In [35]: df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))

In [36]: %timeit pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).a
    ...: stype(int)
116 ms ± 640 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [37]: %%timeit
    ...: a = df.values
    ...: s = a.argmax(1) * - 1
    ...: m, n = a.shape
    ...: r, c = np.ogrid[:m, :n]
    ...: s[s < 0] += n
    ...: c = c - s[:, None]
    ...: pd.DataFrame(a[r, c])
    ...:
    ...:
11.3 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score 2 · Answer 3 · answered Oct 16 '18 at 17:13

For performance, you can use numba. An elementary loop, but effective given JIT-compilation and use of more basic objects at C-level:

from numba import njit

@njit
def shifter(A):
    res = np.zeros(A.shape)
    for i in range(res.shape[0]):
        start, end = 0, 0
        for j in range(res.shape[1]):
            if A[i, j] != 0:
                start = j
                break
        res[i, :res.shape[1]-start] = A[i, start:]
    return res

Performance benchmarking

def jpp(df):
    return pd.DataFrame(shifter(df.values).astype(int))

def user348(df):
    a = df.values
    s = a.argmax(1) * - 1
    m, n = a.shape
    r, c = np.ogrid[:m, :n]
    s[s < 0] += n
    c = c - s[:, None]
    return pd.DataFrame(a[r, c])    

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))

assert np.array_equal(jpp(df).values, user348(df).values)

%timeit jpp(df)      # 9.2 ms per loop
%timeit user348(df)  # 18.5 ms per loop

Paul Panzer · Answer 4 · 2018-10-16T18:41:43.843

Here is a stride_tricks solution, which is fast because it enables slice-wise copying.

def pp(x):
    n, m = x.shape
    am = x.argmax(-1)
    mam = am.max()
    xx = np.empty((n, m + mam), x.dtype)
    xx[:, :m] = x
    xx[:, m:] = 0
    xx = np.lib.stride_tricks.as_strided(xx, (n, mam+1, m), (*xx.strides, xx.strides[-1]))
    return xx[np.arange(x.shape[0]), am]

It pads the input with the required number of zeros and then creates a sliding window view using as_strided. This is addressed using fancy indexing, but necause the last dimension is not indexed copying of lines is optimized and fast.

How fast? For large enough inputs on par with numba:

x = np.random.randint(0, 2, (10000, 10))

from timeit import timeit

shifter(x) # that should compile it, right?

print(timeit(lambda:shifter(x).astype(x.dtype), number=1000))
print(timeit(lambda:pp(x), number=1000))

Sample output:

0.8630472810036736
0.7336142909916816

Yep, I see this small +/- versus `numba` for different shapes. I need to learn `np.lib.stride_tricks`, though am slightly put off by the disclaimer in [the docs](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.lib.stride_tricks.as_strided.html). — jpp, Oct 16 '18 at 18:53
@jpp the sliding window view also exists as a ready-made function in `skimage.util`, which reduces the danger of shooting ones foot. — Paul Panzer, Oct 16 '18 at 19:01

Step left rows values until not null in python

4 Answers4