-1

I have two numpy arrays: X with shape of (n, 16928), and y with shape of (n,1).

I would like to merge the two arrays, drop the duplicated rows if any, and "unmerge" the result back in separate arrays X and y. I was able to merge two arrays, but how do I split them back?

Edit

Here is an example:

X = np.array([
    [1,2,3,4,5,6,7],
    [2,3,4,5,6,7,8],
    [3,4,5,2,1,4,5],
    [1,2,3,4,5,6,7],
    [1,2,3,4,5,6,7],
    [1,2,3,4,5,6,7],
])

y = np.array([
    [2.],
    [3.],
    [4.],
    [2.],
    [3.],
    [4.],
])

Expected result

>>> X, y
(array([[1, 2, 3, 4, 5, 6, 7],
        [2, 3, 4, 5, 6, 7, 8],
        [3, 4, 5, 2, 1, 4, 5],
        [1, 2, 3, 4, 5, 6, 7],
        [1, 2, 3, 4, 5, 6, 7]]),
 array([[2.],
        [3.],
        [4.],
        [3.],
        [4.]]))

As you can see, while there might be duplicated rows in the resulting X, the augmented rows [X | y] are unique. I tried np.unique(np.hstack((X,y))), but this does not returns desirable results.

Pierre D
  • 24,012
  • 7
  • 60
  • 96
AAM
  • 19
  • 7

1 Answers1

1

IIUC, we need to do a little roundtrip via pandas, which has a .drop_duplicates() to drop duplicated rows. While there is a method np.unique(), it is element-wise and sorts the array (costly, often unwanted). (Edit: as per this SO answer, there is a numpy way to get the unique rows; but it still sorts the array -- I still prefer pandas approach for this).

Thus:

import pandas as pd

Xnew, ynew = np.split(
    pd.DataFrame(np.c_[X, y]).drop_duplicates().to_numpy(),
    [-1], axis=-1)

>>> Xnew
array([[1., 2., 3., 4., 5., 6., 7.],
       [2., 3., 4., 5., 6., 7., 8.],
       [3., 4., 5., 2., 1., 4., 5.],
       [1., 2., 3., 4., 5., 6., 7.],
       [1., 2., 3., 4., 5., 6., 7.]])

>>> ynew
array([[2.],
       [3.],
       [4.],
       [3.],
       [4.]])
Pierre D
  • 24,012
  • 7
  • 60
  • 96