Delete repeated columns of array keeping the order

Question

Is there a relatively simple way of removing columns of an (numpy) array and keeping the order of the columns?

As an example, consider this array:

a = np.array([[2, 1, 1, 3],
              [2, 1, 1, 3]])

where I would like column three to be removed such that:

a = np.array([[2, 1, 3],
              [2, 1, 3]])

@Thomas Hope the edits look okay. – Divakar Sep 28 '16 at 09:33 — Divakar, Sep 28 '16 at 09:33

score 1 · Accepted Answer · edited May 23 '17 at 11:53

Approach #1 Here's an approach using broadcasting -

a[:,~np.triu((a[:,None,:] == a[...,None]).all(0),1).any(0)]

Sample run -

In [115]: a
Out[115]: 
array([[2, 1, 3, 5, 1, 3, 7],
       [6, 5, 4, 6, 5, 4, 8]])

In [116]: a[:,~np.triu((a[:,None,:] == a[...,None]).all(0),1).any(0)]
Out[116]: 
array([[2, 1, 3, 5, 7],
       [6, 5, 4, 6, 8]])

Explanation

1) Input array -

In [156]: a
Out[156]: 
array([[2, 1, 3, 5, 1, 3, 7],
       [6, 5, 4, 6, 5, 4, 8]])

2) Use broadcasting to perform elementwise equality comparison keeping the first axis aligned, which would correspond to the column axis from original 2D array -

In [157]: a[:,None,:] == a[...,None]
Out[157]: 
array([[[ True, False, False, False, False, False, False],
        [False,  True, False, False,  True, False, False],
        [False, False,  True, False, False,  True, False],
        [False, False, False,  True, False, False, False],
        [False,  True, False, False,  True, False, False],
        [False, False,  True, False, False,  True, False],
        [False, False, False, False, False, False,  True]],

       [[ True, False, False,  True, False, False, False],
        [False,  True, False, False,  True, False, False],
        [False, False,  True, False, False,  True, False],
        [ True, False, False,  True, False, False, False],
        [False,  True, False, False,  True, False, False],
        [False, False,  True, False, False,  True, False],
        [False, False, False, False, False, False,  True]]], dtype=bool)

3) Since we are looking for duplicate cols, let's look for ALL matches along the first axis -

In [158]: (a[:,None,:] == a[...,None]).all(0)
Out[158]: 
array([[ True, False, False, False, False, False, False],
       [False,  True, False, False,  True, False, False],
       [False, False,  True, False, False,  True, False],
       [False, False, False,  True, False, False, False],
       [False,  True, False, False,  True, False, False],
       [False, False,  True, False, False,  True, False],
       [False, False, False, False, False, False,  True]], dtype=bool)

4) We are looking to keep the first occurrence only, so we can use a upper triangular matrix to set all diagonal and lower triangular elems as False -

In [163]: np.triu((a[:,None,:] == a[...,None]).all(0),1)
Out[163]: 
array([[False, False, False, False, False, False, False],
       [False, False, False, False,  True, False, False],
       [False, False, False, False, False,  True, False],
       [False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False]], dtype=bool)

5) Next up, we look for ANY matches along the first axis indicating the duplicates -

In [164]: (np.triu((a[:,None,:] == a[...,None]).all(0),1)).any(0)
Out[164]: array([False, False, False, False,  True,  True, False], dtype=bool)

6) We are looking to remove those duplicates, so invert the mask -

In [165]: ~(np.triu((a[:,None,:] == a[...,None]).all(0),1)).any(0)
Out[165]: array([ True,  True,  True,  True, False, False,  True], dtype=bool)

7) Finally, we index into the columns of input array with the mask for final output -

In [166]: a[:,~(np.triu((a[:,None,:] == a[...,None]).all(0),1)).any(0)]
Out[166]: 
array([[2, 1, 3, 5, 7],
       [6, 5, 4, 6, 8]])

Approach #2 With focus on memory efficiency and might even be faster, here's an approach considering each column as an indexing tuple -

lidx = np.ravel_multi_index(a,a.max(1)+1)
out = a[:,np.sort(np.unique(lidx,return_index=1)[1])]

Explanation

1) Input array -

In [203]: a
Out[203]: 
array([[2, 1, 3, 5, 1, 3, 7],
       [6, 5, 4, 6, 5, 4, 8]])

2) Calculate linear index equivalents for each column -

In [207]: lidx = np.ravel_multi_index(a,a.max(1)+1)

In [208]: lidx
Out[208]: array([24, 14, 31, 51, 14, 31, 71])

3) Get the first occurence of each unique linear index

In [209]: np.unique(lidx,return_index=1)[1]
Out[209]: array([1, 0, 2, 3, 6])

4) Sort those and index into cols of input array for final o/p -

In [210]: np.sort(np.unique(lidx,return_index=1)[1])
Out[210]: array([0, 1, 2, 3, 6])

In [211]: a[:,np.sort(np.unique(lidx,return_index=1)[1])]
Out[211]: 
array([[2, 1, 3, 5, 7],
       [6, 5, 4, 6, 8]])

For a detailed info on the considerations related to converting to indexing tuples, please refer to this post.

This seems to work rather well on my examples. Do you mind explaining the line? (I am still a beginner in terms of Python.) — Thomas, Sep 28 '16 at 09:45

Delete repeated columns of array keeping the order

1 Answers1