3
train_subset.iloc[:10, :10]
1  0  7045    17    16.0    0.0    0.0   0.0   0.0  18.0
1  0  8907   137    48.0   42.0   53.0  32.0  35.0  30.0
1  0   917   167   193.0   88.0   48.0   0.0  24.0   0.0
1  0   203  5303  2073.0  153.0  108.0  97.0  73.0  89.0
1  0   198   817   198.0    0.0    0.0  88.0  93.0  70.0

I have a dataset called train. I've printed out a small subset of it here.

Is there an efficient way to just REMOVE all zeroes from the dataset, updating the columns by shifting to the left? For example, my desired output would be...

Please note, I want this to only trigger starting on the 4th column. The first, second, and third column are meta-data and should not be changed.

train_subset.iloc[:10, :10]
1  0  7045    17    16.0   18.0
1  0  8907   137    48.0   42.0   53.0  32.0  35.0  30.0
1  0   917   167   193.0   88.0   48.0  24.0 
1  0   203  5303  2073.0  153.0  108.0  97.0  73.0  89.0
1  0   198   817   198.0   88.0   93.0  70.0

Essentially, you can think of each row as a time-series, and I want to disregard any zeroes. So I only have readings at timepoints where events occur, and I've removed all zeroes between.

I can figure out how to remove zeroes and replace with NA's or whatever, but I cannot figure out how to forcibly shift each row to the left.

Jibril
  • 967
  • 2
  • 11
  • 29
  • Try this -> https://stackoverflow.com/a/31206033/9455902 – Piotr Żak Mar 21 '21 at 13:49
  • "Replace the zeros with nan and then drop the rows with all entries as nan. After that replace nan with zeros." I don't want to drop rows with NAN entries though. I just want to delete the 0 (NAN) and then shift everything column-wise left. – Jibril Mar 21 '21 at 13:52

1 Answers1

6

np.argsort + boolean masking

s = df.iloc[:, 3:]
i = np.argsort(s.eq(0), axis=1)
df.iloc[:, 3:] = np.take_along_axis(s[s.ne(0)].values, i.values, axis=1)

Explanations

Slice the portion of given dataframe starting from the fourth column onward then compare it with 0 to create a boolean mask. The idea behind creating the boolean mask is that when we sort the mask in the next step, then the False values will come first while also maintaining the relative order which essentially provide a shift like behaviour.

>>> s.eq(0)

       3      4      5      6      7      8      9
0  False  False   True   True   True   True  False
1  False  False  False  False  False  False  False
2  False  False  False  False   True  False   True
3  False  False  False  False  False  False  False
4  False  False   True   True  False  False  False

Now use np.argsort on the above mask to get the indices that would sort the values in the slice of dataframe s.

>>> np.argsort(s.eq(0), axis=1)

   3  4  5  6  7  8  9
0  0  1  6  2  3  4  5
1  0  1  2  3  4  5  6
2  0  1  2  3  5  4  6
3  0  1  2  3  4  5  6
4  0  1  4  5  6  2  3

Then use np.take_along_axis with the above indices to sort the values in s, and assign these sorted values in the given dataframe df.

>>> np.take_along_axis(s[s.ne(0)].values, i.values, axis=1)

array([[  17.,   16.,   18.,   nan,   nan,   nan,   nan],
       [ 137.,   48.,   42.,   53.,   32.,   35.,   30.],
       [ 167.,  193.,   88.,   48.,   24.,   nan,   nan],
       [5303., 2073.,  153.,  108.,   97.,   73.,   89.],
       [ 817.,  198.,   88.,   93.,   70.,   nan,   nan]])

Result

>>> df

   0  1     2       3       4      5      6     7     8     9
0  1  0  7045    17.0    16.0   18.0    NaN   NaN   NaN   NaN
1  1  0  8907   137.0    48.0   42.0   53.0  32.0  35.0  30.0
2  1  0   917   167.0   193.0   88.0   48.0  24.0   NaN   NaN
3  1  0   203  5303.0  2073.0  153.0  108.0  97.0  73.0  89.0
4  1  0   198   817.0   198.0   88.0   93.0  70.0   NaN   NaN
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53