Finding unique values in each row

Question

I have an array with strings of size of 2 and want to get unique strings in each row.

np.__version__
# '1.19.2'
arr = np.array([['Z7', 'Q4', 'Q4'], # 2 unique strings
                ['Q4', 'Z7', 'Q4'], # 2 unq strings
                ['Q4', 'Z7', 'Z7'], # 2 unq strings
                ['Z7', 'Z7', 'Q4'], # 2 unq strings
                ['D8', 'D8', 'L1'], # 2 unq strings
                ['L1', 'L1', 'D8']], dtype='<U2') # 2 unq strings

It is guaranteed that every row contains the same number of uniques strings i.e. every row will have the same number of unique strings in my case it's 2.

Expected output:

array([['Q4', 'Z7'],
       ['Q4', 'Z7'],
       ['Q4', 'Z7'],
       ['Q4', 'Z7'],
       ['D8', 'L1'],
       ['D8', 'L1']], dtype='<U2')

Here, each row is sorted but it's doesn't have to be. It's fine both ways.

My code:

np.apply_along_axis(np.unique, 1, arr)

# array([['Q4', 'Z7'],
#        ['Q4', 'Z7'],
#        ['Q4', 'Z7'],
#        ['Q4', 'Z7'],
#        ['D8', 'L1'],
#        ['D8', 'L1']], dtype='<U2')

I thought np.unique over axis 1 would give expected results but

np.unique(arr, axis=1)
# array([['Q4', 'Q4', 'Z7'],
#        ['Q4', 'Z7', 'Q4'],
#        ['Z7', 'Z7', 'Q4'],
#        ['Q4', 'Z7', 'Z7'],
#        ['L1', 'D8', 'D8'],
#        ['D8', 'L1', 'L1']], dtype='<U2')

I couldn't understand what exactly happened and why it returned this exact output.

Stefan · Answer 1 · 2020-11-27T11:21:41.123

2

That is because numpy.unique flattens either the row or column subarrays and then returns the unique rows (axis = 0) or columns (axis = 1), instead of the unique values itself. Take a look at this example:

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
np.unique(a, axis=0)

The output is:

array([[1, 0, 0], [2, 3, 4]])

and

b = np.array([[1, 1, 0], [1, 1, 0], [2, 2, 4]])
np.unique(b, axis=1)

The output is:

array([[0, 1],
       [0, 1],
       [4, 2]])

In your case you want the unique values per row itself and therefore should apply the along_axis command like you already implemented. The axis = 1 does not do much as your columns are all unique and only shows some sorting.

edited Nov 27 '20 at 11:21

answered Nov 27 '20 at 11:15

Stefan

897
4
13

Thank you. Now I can make sense of the output given by `np.unique(arr, axis=1)`, Is there any better way than `np.apply_along_axis`? – Ch3steR Nov 27 '20 at 12:14
I don't really know why you'd want a different solution as this works really well? – Stefan Nov 27 '20 at 14:08
`np.apply_along_axis` is slow it's just a `for-loop` under the hood, not vectorized. [`More details here`](https://stackoverflow.com/questions/23849097/numpy-np-apply-along-axis-function-speed-up) – Ch3steR Nov 27 '20 at 14:20

score 1 · Accepted Answer · answered Nov 27 '20 at 12:04

Documentation of np.unique, in the description of axis parameter, contains the following statement:

... subarrays indexed by the given axis will be be flattened treated as the elements of a 1-D array

So if you call np.unique, passing axis=1, then:

Each column is flattened (as each column contains "atomic" values, nothing happens).
Finding of unique elements is performed on the resulting list (list of columns). If 2 columns were just the same then only one of them would have been retained.
The result is presented possibly in a changed order (this is an internal implementation detail.

A bit of explanation why each column (not row): Axis "1" is actually "columns".

To confirm that in this case each column is the processe object, define the source array as:

arr_2 = np.array([['Z7', 'Q4', 'Q4', 'Q4'],
                  ['Q4', 'Z7', 'Q4', 'Q4'],
                  ['Q4', 'Z7', 'Z7', 'Z7'],
                  ['Z7', 'Z7', 'Q4', 'Q4'],
                  ['D8', 'D8', 'L1', 'L1'],
                  ['L1', 'L1', 'D8', 'D8']])

where 2 last columns are just the same.

When you execute np.unique(arr_2, axis=1), the result will be just the same. Two last columns were exactly the same, so one of them has been eliminated.

Your explanation hit home for me. Thank you. Is there any better way than `np.apply_along_axis`? — Ch3steR, Nov 27 '20 at 12:14
I think, there is no better approach than *apply_along_axis*. — Valdi_Bo, Nov 27 '20 at 12:16

Finding unique values in each row

My code:

2 Answers2