1

I have a question about searching through arrays. I need to find some string sequence which I have stored in array and forexample it might look like this array1:

['818181' '747473' '747474' '636363' '767676' '737373' '727373' '373838'
 '697070' '686869' '115115115' '737474' '757575' '777777' '818181' '747473'
 '747474' '636363' '767676' '737373' '727373' '757575' '696969']

It's a numpy array with strings. Dtype shows it is S9. Then I have another main array with same structure, but much more bigger. I am looking for most effective way to find position where array1 begins in main array so it is like I am looking for specified pattern in numpy array. These value repeat and I need to find exact same one. I have been looking for best solutions for this but I couldn't find anything what would help. Main array is very big and I need to get array1 position in under 1s. I found some sample scripts on finding sequence in array unfortunately none of this helped me. Mostly they were finding some integer values in small arrays. I need some advice.

I tried going through whole array with for i, e in enumerate() So one element look like this - '818181'. Then I was counting if 23 elements in a row(in this example) are the same. But when 5th element was wrong then I would have to go where I found 1st to get 100% successful(because patters can go on each other) and it was very slow.

Main array is like array1 but 1000x and has more values

2 Answers2

0

You can loop through each value in array1 and use np.where() to get the indices of the values in your main array. Add the indices to a list then sort the list. Then find the consecutive length of indices that matches the length of array 1.

For example:

def consecutive(data, stepsize=1):
    return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)

index_list = []
for val in array1:
    index_list.extend(list(np.where(main_array == val)))
index_list.sort()

for sequence in consecutive(index_list):
    if len(sequence) == len(array1):
        print(sequence)

Credits to @unutbu from how to find the groups of consecutive elements from an array in numpy? for the consecutive function.

-1

Not at all; look again. When you reach the 5th element, you already know that elements 2, 3, 4 are not the first element, so you simply shift to start over at the non-matching element.

This is a well-known problem in grammars that can be handled with a finite-state machine.

First don't worry about the content of your strings; all that matters is that you have a sequence of symbols to find. Each "number" string is a different symbol. FOr convenience, let's map as follows:

'818181' => a
'747473' => b
'747474' => c
etc.

So the array can be reduced to something like this:

 '818181' '747473' '747474' '636363' '767676' '737373' '727373' '373838'
  a        b        c        d        e        f        g        h
 '697070' '686869' '115115115' '737474' '757575' '777777' '818181' '747473'
  i        j        k           l        m        n        a        b
 '747474' '636363' '767676' '737373' '727373' '757575' '696969']
  c        d        e        f        g        m        o

Or, as a one-line sequence:

  abcdefghijklmnabcdefgmo

In the case you noted, a mismatch at b, we do not have to back up to the b postion of the input and start over; we've already identified that bcd match, and they're not a, so we don't back up: we simply begin again, comparing a to the item that didn't match.

As it happens, we never have to back up. At worst, we will resume checking in the location where the match failed, but not at the start of our target string. There is one tricky case we have to handle: the mid-string match.

Consider what happens when we have a mismatch at the second m, near the end of the target sequence. In this case, we know that we've just matched abcdefg, but the current symbol is not m ... but if might be h. To avoid backing up, we take advantage of the partial match, and restart our checking with h.

To handle this algorithm, you need to do some pre-processing on your target string. Make a second array, holding the restart index for each position in the target string. You do this with a simple check of where it deviates from the front of itself. For your example, it's simple: that o is the only place where the main string and the shifted string match for several characters, but differ at this location.

  abcdefghijklmnabcdefgmo
  11111111111111111111181

Does that get you moving?

Prune
  • 76,765
  • 14
  • 60
  • 81