2

Say I have a Pandas dataframe with 4 rows and 5 columns. For simplicity I will convert it into a Numpy array, which looks like this:

import numpy as np

A = np.array([[23, 43, 23, 110, 5],
              [83, 32, 12, 123, 4],
              [58, 41, 59, 189, 1],
              [93, 77, 22, 170, 3]])

For each row, I want to insert some similar rows right after the row, with only the column 4 decreasing by 1 each time until 0. The expected output should look like this:

np.array([[23, 43, 23, 110, 5],
          [23, 43, 23, 110, 4],
          [23, 43, 23, 110, 3],
          [23, 43, 23, 110, 2],
          [23, 43, 23, 110, 1],
          [23, 43, 23, 110, 0],

          [83, 32, 12, 123, 4],
          [83, 32, 12, 123, 3],
          [83, 32, 12, 123, 2],
          [83, 32, 12, 123, 1],
          [83, 32, 12, 123, 0],

          [58, 41, 59, 189, 1],
          [58, 41, 59, 189, 0],

          [93, 77, 22, 170, 3],
          [93, 77, 22, 170, 2],
          [93, 77, 22, 170, 1],
          [93, 77, 22, 170, 0]])

Below is the code that I have come up with:

new_rows = []
for i, row in enumerate(A):
    new = A[i, 4] - 1
    while new >= 0:
        new_row = row.copy()
        new_row[4] = new
        new_rows.append(new_row)
        new -= 1
new_A = np.vstack([A, np.array(new_rows)])
print(new_A)

Output

[[ 23  43  23 110   5]
 [ 83  32  12 123   4]
 [ 58  41  59 189   1]
 [ 93  77  22 170   3]
 [ 23  43  23 110   4]
 [ 23  43  23 110   3]
 [ 23  43  23 110   2]
 [ 23  43  23 110   1]
 [ 23  43  23 110   0]
 [ 83  32  12 123   3]
 [ 83  32  12 123   2]
 [ 83  32  12 123   1]
 [ 83  32  12 123   0]
 [ 58  41  59 189   0]
 [ 93  77  22 170   2]
 [ 93  77  22 170   1]
 [ 93  77  22 170   0]]

Obviously, the code is not efficient since it doesn't use any Numpy vectorization. In reality, I have more than 4,000 original rows so a speed-up is definitely needed. Moreover, I cannot insert new rows right after each row. Is there any efficient way to do this in Numpy or Pandas?

Shaun Han
  • 2,676
  • 2
  • 9
  • 29

3 Answers3

4

If you are interested in a solution not using any pure-Python loops but Numba instead, then here is one:

import numba as nb

# 2 overloads: one for np.int32 types and one for np.int64 regarding the type of A
@nb.njit(['int32[:,::1](int32[:,::1])', 'int64[:,::1](int64[:,::1])'])
def compute(A):
    n, m = A.shape
    rows = A[:,-1].sum() + n
    res = np.empty((rows, m), dtype=A.dtype)
    row = 0
    for i in range(n):
        count = A[i, -1]
        for j in range(count+1):
            res[row+j, 0:m-1] = A[i, 0:m-1]
            res[row+j, m-1] = count-j
        row += count+1
    return res

result = compute(A)

This solution is 12 times faster than the solution of @sammywemmy on my machine although A is very small. It should be even faster on bigger inputs.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • I have tried your code, but it raised `TypeError: No matching definition for argument type(s) array(int64, 2d, C)` – wong.lok.yin Nov 23 '21 at 03:05
  • @jasonwong It should be because the input array is of type int64 for integer by default on your system (possibly a Linux/Mac) while it is int32 generally on Windows. The type int_ may work but I fixed the code to implement the two. Thank you for reporting this. – Jérôme Richard Nov 23 '21 at 08:57
  • 1
    @Jérôme Richard I am using colab, your updated code works now. – wong.lok.yin Nov 23 '21 at 09:30
2
arr = A[:, -1] + 1

temp = np.repeat(A, arr, axis = 0)

# depending on your array size
# you can build the range here with a much faster implementation
from this link : # https://stackoverflow.com/a/47126435/7175713
arr = np.concatenate([np.arange(ent) for ent in arr])

temp[:, -1] = temp[:, -1] - arr

temp

array([[ 23,  43,  23, 110,   5],
       [ 23,  43,  23, 110,   4],
       [ 23,  43,  23, 110,   3],
       [ 23,  43,  23, 110,   2],
       [ 23,  43,  23, 110,   1],
       [ 23,  43,  23, 110,   0],
       [ 83,  32,  12, 123,   4],
       [ 83,  32,  12, 123,   3],
       [ 83,  32,  12, 123,   2],
       [ 83,  32,  12, 123,   1],
       [ 83,  32,  12, 123,   0],
       [ 58,  41,  59, 189,   1],
       [ 58,  41,  59, 189,   0],
       [ 93,  77,  22, 170,   3],
       [ 93,  77,  22, 170,   2],
       [ 93,  77,  22, 170,   1],
       [ 93,  77,  22, 170,   0]])


sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Just curious: were you inspired to use `np.repeat` based on my comment on the question (wouldn't mind if you did)? Or did you probably already know to use it... –  Nov 23 '21 at 01:25
2

I've finally figured out a solution:

rep = np.repeat(A, A[:, -1] + 1, axis=0)
rep[:, -1] = np.concatenate([np.arange(0, n+1)[::-1] for n in A[:, -1]])

Output:

>>> rep
array([[ 23,  43,  23, 110,   5],
       [ 23,  43,  23, 110,   4],
       [ 23,  43,  23, 110,   3],
       [ 23,  43,  23, 110,   2],
       [ 23,  43,  23, 110,   1],
       [ 23,  43,  23, 110,   0],
       [ 83,  32,  12, 123,   4],
       [ 83,  32,  12, 123,   3],
       [ 83,  32,  12, 123,   2],
       [ 83,  32,  12, 123,   1],
       [ 83,  32,  12, 123,   0],
       [ 58,  41,  59, 189,   1],
       [ 58,  41,  59, 189,   0],
       [ 93,  77,  22, 170,   3],
       [ 93,  77,  22, 170,   2],
       [ 93,  77,  22, 170,   1],
       [ 93,  77,  22, 170,   0]])