3

from given numpy array [1,2,3,4] and window wz=2 (two elements before and two elements after every element) I have to get pairs (central el, el from window). Pairs with unexisting elements could be skipped or substituted by zero. So on this example I have to get this:

[[1., 0.]
 [2., 1.]
 [3., 2.]
 [4., 3.]
 [1., 2.]
 [2., 3.]
 [3., 4.]
 [4., 0.]
 [1., 0.]
 [2., 0.]
 [3., 1.]
 [4., 2.]
 [1., 3.]
 [2., 4.]
 [3., 0.]
 [4., 0.]]

My implementation is extremely unefficient and looks like:

x = np.array([1,2,3,4])
l = x.shape[0]
for i in range(1, m):
    init = np.empty((x.shape[0]*2,2))
    init[:,0] = np.append(x, x)
    init[:l,1] = np.pad(x, (i,0), mode='constant')[:l]
    init[-l:,1] = np.pad(x, (0,i), mode='constant')[-l:]
    corpus.extend(init)

Could someone help with much more efficient solution? On another simple test data and variants I've implemented I've got:

285 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
379 µs ± 7.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Ivan Telnov
  • 335
  • 2
  • 10

3 Answers3

2

Here is a Numpythonic approach:

In [23]: a = np.array([1,2,3,4])
In [24]: arr = np.hstack((a-1, a+1, a - 2, a+ 2))
In [25]: mask = ~np.in1d(arr, a)
In [26]: arr[mask] = 0
In [27]: np.column_stack((np.tile(a, 4), arr))
Out[27]: 
array([ [1, 0],
        [2, 1],
        [3, 2],
        [4, 3],
        [1, 2],
        [2, 3],
        [3, 4],
        [4, 0],
        [1, 0],
        [2, 0],
        [3, 1],
        [4, 2],
        [1, 3],
        [2, 4],
        [3, 0],
        [4, 0]])
Mazdak
  • 105,000
  • 18
  • 159
  • 188
1

In case x is some data, like words or random values and we need to recombine it we could use reindexing mechanism in numpy.

Substituted by zero version

x = np.array([1,2,3,4])
wz = 2
zero = 0

Lets build indexing matrix.

ri = np.arange(-wz,wz+1)+np.arange(x.shape[0]).reshape(-1,1)
print(ri) 

Output:

  [[-2, -1,  0,  1,  2],
   [-1,  0,  1,  2,  3],
   [ 0,  1,  2,  3,  4],
   [ 1,  2,  3,  4,  5]

Now if we add zero to x as last element we can replace wrong indexes with it index.

np.place(ri,(ri<0)|(ri>x.shape[0]),x.shape[0]) #replace wrong indexes
np.vstack((
    np.hstack((x,[zero]))[ri].reshape(1,-1),#extending x with zero and reindexing 
    np.tile(x,2*wz+1)) #repeating basic `x` to each window position
    )#.T #uncomment .T to make it vertical   

Output:

 ([[0, 0, 1, 2, 3, 0, 1, 2, 3, 4, 1, 2, 3, 4, 0, 2, 3, 4, 0, 0],
   [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]])

Skipped version

The same idea, but in a slightly different order: produce a complete indexing matrix [window_index,x_index] then exclude the wrong pairs, and finally re-indexing 'x'.

x = np.array([1,2,3,4])
wz = 2
ri = np.vstack((
    (np.arange(-wz,wz+1)+np.arange(x.shape[0]).reshape(-1,1)).ravel(),#same index matrix flaten 
    np.tile(np.arange(x.shape[0]),2*wz+1) #repeating `x` indexes to each window position
    )) 
x[ri[:,(ri[0]>=0)&(ri[0]<x.shape[0])]]#.T #uncomment .T to make it vertical   

Output:

 [[1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4],
  [3, 4, 1, 3, 4, 1, 2, 3, 4, 1, 2, 4, 1, 2]]

Update 1 (error fix) exclude zero from window to avoid pair duplication.

x = np.array([1,2,3,4])
wz = 2
ri = np.vstack(((
        np.hstack(( np.arange(-wz,0), #remove zero from window
                    np.arange(1,wz+1)))+
        np.arange(x.shape[0]).reshape(-1,1)).ravel(), #same index matrix flaten 
    np.tile(np.arange(x.shape[0]),2*wz) #repeating `x` indexes to each window position
    )) 
x[ri[:,(ri[0]>=0)&(ri[0]<x.shape[0])]]#.T #uncomment .T to make it vertical   

Output:

  [[2, 3, 1, 3, 4, 1, 2, 4, 2, 3],
   [3, 4, 2, 3, 4, 1, 2, 3, 1, 2]]

Check documentation on used functions np.arange, np.reshape, np.place, np.hstack, broadcasting rules and indexing.

ilia timofeev
  • 1,109
  • 7
  • 15
  • 1
    Ilia, thanks, your solution is exactly the one I've looked for, which accept random input parameters for initial array and window size. However the output wasn't the same then my. I found the issue in indexing matrix - the initial array used to construct it should not contain zeros [-2 -1 2 1 ] instead of [-2 -1 0 1 2 ] – Ivan Telnov Feb 25 '18 at 13:43
  • @IvanTelnov Did you run performance tests? What is new benchmark? – ilia timofeev Feb 25 '18 at 14:56
  • so my previous (different to one i've gave in question) solution vs one according your approach:304 µs ± 8.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) VS 84.2 µs ± 748 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) – Ivan Telnov Feb 25 '18 at 23:04
  • 1
    Why do optimize so small timing? Maybe we should optimaze bigger scope to have valuable improvement? – ilia timofeev Feb 25 '18 at 23:29
  • Every input array represents the indexes of words in it. So the size of loop could not be larger then just amount of words in very big sentence. – Ivan Telnov Feb 26 '18 at 00:31
  • And unfortunately I cant concatenate all sentences in one big array and then use slide-window function in it, cos window shouldn't take words from different sentences, so in both cases I iterate through given arrays (sentences) and them use sliding-window function. That's why testing in on really big array doesn't make a sense for me. However, the difference of performance really grows with amount of sentences, from 2 times better (as I've shown) to 6 times, but it's rather cumulative performance. Now I think how to change my algorithm to use prospective performance of current solution better – Ivan Telnov Feb 26 '18 at 00:32
0

The numpy approach is favorable, but here is a functional approach for those interested:

Given

import functools as ft


# Helper function
def curry(f):
    @ft.wraps(f)
    def wrapped(arg):
        try:
            return f(arg)
        except TypeError:
            return curry(ft.wraps(f)(ft.partial(f, arg)))
    return wrapped

Code

lst = [1, 2, 3, 4]
c = curry(lambda x, y: x + y)
funcs = [c(-1), c(1), c(-2), c(2)]
set_ = set(lst)


[[x, 0] if fn(x) not in set_ else [x, fn(x)] for fn in funcs for x in lst]

Output

[[1, 0],
 [2, 1],
 [3, 2],
 [4, 3],
 [1, 2],
 [2, 3],
 [3, 4],
 [4, 0],
 [1, 0],
 [2, 0],
 [3, 1],
 [4, 2],
 [1, 3],
 [2, 4],
 [3, 0],
 [4, 0]]

Details

In the double for loops of the list comprehension, a list of curried functions is iterated, and each function is applied to every element of the primary list (lst). Currying allows you to compute new values by passing in some argument (e.g. 1, -1, -2, 2) and later passing in an element from the primary list.

Tuples are created, e.g (primary element, computed element). The conditional part of the list comprehension, substitutes 0 for computed elements not found in primary list.

See also this implementation of the curry function.

pylang
  • 40,867
  • 14
  • 129
  • 121