3

I have to handle a huge amount of data. Every row starts with 1 or 0. I need a dataframe where every rows start with 1, so I have to step left all rows values till the first value is 1.

For example:

0 1 0 0 1 0 0
1 0 0 0 0 1 1
0 0 0 1 0 0 1
0 0 0 0 0 1 1

The result has to be this:

1 0 0 1 0 0 0
1 0 0 0 0 1 1
1 0 0 1 0 0 0
1 1 0 0 0 0 0

I don't want to use for, while, etc., because I need some faster methods with pandas or numpy.

Do you have idea for this problem?

jpp
  • 159,742
  • 34
  • 281
  • 339
KorG
  • 96
  • 7
  • are you doing circular shift or filling with `0`? – pault Oct 16 '18 at 16:31
  • @pault want to flag as dupe? Just noticed my solution is basically the accepted answer there. – user3483203 Oct 16 '18 at 17:10
  • @user3483203 I don't know- it depends on if the desired shift is circular, no? – pault Oct 16 '18 at 17:22
  • 1
    @pault the shift being circular should never matter, because you will always stop when a 1 is in index 0, which means that a 1 will never wrap around. Either way, the other answer doesn't talk about how to determine the offset, so I think it's fine to leave this open – user3483203 Oct 16 '18 at 18:03

4 Answers4

4

You may using with cummax to mask all position need to shift as NaN and sorted

df[df.cummax(1).ne(0)].apply(lambda x : sorted(x,key=pd.isnull),1).fillna(0).astype(int)
Out[310]: 
   1  2  3  4  5  6  7
0  1  0  0  1  0  0  0
1  1  0  0  0  0  1  1
2  1  0  0  1  0  0  0
3  1  1  0  0  0  0  0

Or we using the function justify write by Divakar(much faster than the apply sorted)

pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).astype(int)
Out[314]: 
   0  1  2  3  4  5  6
0  1  0  0  1  0  0  0
1  1  0  0  0  0  1  1
2  1  0  0  1  0  0  0
3  1  1  0  0  0  0  0
BENY
  • 317,841
  • 20
  • 164
  • 234
3

You can make use of numpy.ogrid here:

a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
a[r, c]

array([[1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 1],
       [1, 0, 0, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0]], dtype=int64)

Timings

In [35]: df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))

In [36]: %timeit pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).a
    ...: stype(int)
116 ms ± 640 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [37]: %%timeit
    ...: a = df.values
    ...: s = a.argmax(1) * - 1
    ...: m, n = a.shape
    ...: r, c = np.ogrid[:m, :n]
    ...: s[s < 0] += n
    ...: c = c - s[:, None]
    ...: pd.DataFrame(a[r, c])
    ...:
    ...:
11.3 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
user3483203
  • 50,081
  • 9
  • 65
  • 94
2

For performance, you can use numba. An elementary loop, but effective given JIT-compilation and use of more basic objects at C-level:

from numba import njit

@njit
def shifter(A):
    res = np.zeros(A.shape)
    for i in range(res.shape[0]):
        start, end = 0, 0
        for j in range(res.shape[1]):
            if A[i, j] != 0:
                start = j
                break
        res[i, :res.shape[1]-start] = A[i, start:]
    return res

Performance benchmarking

def jpp(df):
    return pd.DataFrame(shifter(df.values).astype(int))

def user348(df):
    a = df.values
    s = a.argmax(1) * - 1
    m, n = a.shape
    r, c = np.ogrid[:m, :n]
    s[s < 0] += n
    c = c - s[:, None]
    return pd.DataFrame(a[r, c])    

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))

assert np.array_equal(jpp(df).values, user348(df).values)

%timeit jpp(df)      # 9.2 ms per loop
%timeit user348(df)  # 18.5 ms per loop
jpp
  • 159,742
  • 34
  • 281
  • 339
2

Here is a stride_tricks solution, which is fast because it enables slice-wise copying.

def pp(x):
    n, m = x.shape
    am = x.argmax(-1)
    mam = am.max()
    xx = np.empty((n, m + mam), x.dtype)
    xx[:, :m] = x
    xx[:, m:] = 0
    xx = np.lib.stride_tricks.as_strided(xx, (n, mam+1, m), (*xx.strides, xx.strides[-1]))
    return xx[np.arange(x.shape[0]), am]

It pads the input with the required number of zeros and then creates a sliding window view using as_strided. This is addressed using fancy indexing, but necause the last dimension is not indexed copying of lines is optimized and fast.

How fast? For large enough inputs on par with numba:

x = np.random.randint(0, 2, (10000, 10))

from timeit import timeit

shifter(x) # that should compile it, right?

print(timeit(lambda:shifter(x).astype(x.dtype), number=1000))
print(timeit(lambda:pp(x), number=1000))

Sample output:

0.8630472810036736
0.7336142909916816
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Yep, I see this small +/- versus `numba` for different shapes. I need to learn `np.lib.stride_tricks`, though am slightly put off by the disclaimer in [the docs](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.lib.stride_tricks.as_strided.html). – jpp Oct 16 '18 at 18:53
  • 1
    @jpp the sliding window view also exists as a ready-made function in `skimage.util`, which reduces the danger of shooting ones foot. – Paul Panzer Oct 16 '18 at 19:01