Merge two numpy arrays, drop duplicates, and unmerge it

Question

I have two numpy arrays: X with shape of (n, 16928), and y with shape of (n,1).

I would like to merge the two arrays, drop the duplicated rows if any, and "unmerge" the result back in separate arrays X and y. I was able to merge two arrays, but how do I split them back?

Edit

Here is an example:

X = np.array([
    [1,2,3,4,5,6,7],
    [2,3,4,5,6,7,8],
    [3,4,5,2,1,4,5],
    [1,2,3,4,5,6,7],
    [1,2,3,4,5,6,7],
    [1,2,3,4,5,6,7],
])

y = np.array([
    [2.],
    [3.],
    [4.],
    [2.],
    [3.],
    [4.],
])

Expected result

>>> X, y
(array([[1, 2, 3, 4, 5, 6, 7],
        [2, 3, 4, 5, 6, 7, 8],
        [3, 4, 5, 2, 1, 4, 5],
        [1, 2, 3, 4, 5, 6, 7],
        [1, 2, 3, 4, 5, 6, 7]]),
 array([[2.],
        [3.],
        [4.],
        [3.],
        [4.]]))

As you can see, while there might be duplicated rows in the resulting X, the augmented rows [X | y] are unique. I tried np.unique(np.hstack((X,y))), but this does not returns desirable results.

Can you clarify what you mean by "drop duplicates"? Ideally include a smaller [mre] in your question — Pranav Hosangadi, Feb 16 '23 at 22:27
Please [edit](https://stackoverflow.com/posts/75478510/edit) your question to include a [minimum reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) **with test data** demonstrating your problem. — Woodford, Feb 16 '23 at 22:29
Not really, no. your "expected output" still has duplicate rows in them. — Mike 'Pomax' Kamermans, Feb 17 '23 at 02:29
I edited your question in an attempt to clarify -- of course, please revert if I misunderstood your intent. — Pierre D, Feb 17 '23 at 02:44
"but how do I split them back" Well, when you look at the merged data, **what is the rule that tells you** which part came from X and which part came from y? — Karl Knechtel, Feb 17 '23 at 03:12
Thank you @PierreD, yes this is the problem I was facing. Thank you for rephrasing it. — AAM, Feb 17 '23 at 11:09
Hey! Yes it worked!! It literally halved memory consumption lol — AAM, Feb 17 '23 at 21:48

Pierre D · Accepted Answer · 2023-02-17T03:08:45.307

IIUC, we need to do a little roundtrip via pandas, which has a .drop_duplicates() to drop duplicated rows. While there is a method np.unique(), it is element-wise and sorts the array (costly, often unwanted). (Edit: as per this SO answer, there is a numpy way to get the unique rows; but it still sorts the array -- I still prefer pandas approach for this).

Thus:

import pandas as pd

Xnew, ynew = np.split(
    pd.DataFrame(np.c_[X, y]).drop_duplicates().to_numpy(),
    [-1], axis=-1)

>>> Xnew
array([[1., 2., 3., 4., 5., 6., 7.],
       [2., 3., 4., 5., 6., 7., 8.],
       [3., 4., 5., 2., 1., 4., 5.],
       [1., 2., 3., 4., 5., 6., 7.],
       [1., 2., 3., 4., 5., 6., 7.]])

>>> ynew
array([[2.],
       [3.],
       [4.],
       [3.],
       [4.]])

Merge two numpy arrays, drop duplicates, and unmerge it

1 Answers1