Closest character in a string using python

Question

I am relatively new to Python; I wrote the following code to find the closest character in a string with respect to the index in the queries, and I want to find a way to optimize the code:

Example:

Input string: s = 'adarshravi'

And queries = [2, 4] (these are indexes of the characters whose duplicates are to be found, and the output should have the index of the closest duplicate, and if there are no duplicate characters then the output for that will be -1)

The output for the above queries will be: [0, -1]

Explanation for the output:

For index 2 the character in the string is a there are two other a's in the string, one at 0 index and the other one at index 7, so the closest between the two is the one at the 0'th position, and the character at the 4th index is s which is not repeated in the string so -1

def closest(s, queries):

    s = s.lower()
    listIdx = []

    for i in queries:
        foundidx = []
        srchChr = s[i]

        for j in range(0, len(s)):
            if s[j] == srchChr:
                foundidx.append(j)

        if len(foundidx) < 2:
            listIdx.append(-1)
        else:
            lastIdx = -1
            dist = 0
            foundidx.remove(i)
            for fnditem in foundidx:
                if dist == 0:
                    lastIdx = fnditem
                    dist = abs(fnditem - i)
                else:
                    if abs(fnditem - i) < dist:
                        lastIdx = fnditem
                        dist = abs(fnditem - i)
            listIdx.append(lastIdx)
    return listIdx

Closest to what? The original characters index? or closest to 0? Also, what's your question here? Does the code not do what you expect it to do? What errors do you get? — Capn Jack, Oct 13 '18 at 17:52
@CapnJack Original characters index. The code works fine and not getting any errors, I just want to optimize the code. Like, take less time in getting the output! — Adarsh Ravi, Oct 13 '18 at 17:55
What if the two closests have the same distance, like `'aaa'` and `1`? — Willem Van Onsem, Oct 13 '18 at 18:07

score 3 · Accepted Answer · answered Oct 13 '18 at 18:19

We can construct a list of indexes like:

from itertools import zip_longest

def ranges(k, n):
    for t in zip_longest(range(k-1, -1, -1), range(k+1, n)):
        yield from filter(lambda x: x is not None, t)

this thus generates indices like:

>>> list(ranges(3, 10))
[2, 4, 1, 5, 0, 6, 7, 8, 9]

We can use the above to check the closest character:

def close(text, idx):
    ci = text[idx]
    return next(filter(lambda i: ci == text[i], ranges(idx, len(text))), -1)

This then yields:

>>> close('adarshravi', 0)
2
>>> close('adarshravi', 1)
-1
>>> close('adarshravi', 2)
0
>>> close('adarshravi', 3)
6
>>> close('adarshravi', 4)
-1

closest is then simply the "mapping" of the close function over a list:

from functools import partial

def closest(text, indices):
    return map(partial(close, text), indices)

for example:

>>> list(closest('adarshravi', range(5)))
[2, -1, 0, 6, -1]

Gonzales Gokhan · Answer 2 · 2018-10-13T19:53:17.793

def closest_duplicates(s, queries):
    result = []
    for index in queries:
        result.append(closest_duplicate(s, s[index], index))
    return result

this guy searches for individual items

following code starts 2 indexes: one from start to the left and the other to the right. we dont need to run this loop more than length of string - 1. when they reach the end or first time the character is found, we return the index. if not found ,we return -1

def closest_duplicate(s, letter, index):
    min_distance = -1
    for i in range(1, len(s)):
        left_i = index - i
        right_i = index + i
        if left_i == -1 and right_i == len(s):
            break

        if left_i > -1 and s[left_i] == letter :
            min_distance = left_i
            break
        if right_i < len(s) and s[right_i] == letter:
            min_distance = right_i
            break
    return min_distance

tests are at below

if __name__ == '__main__':
    s = 'adarshravi'
    indexes = [2, 4]
    result = closest_duplicates(s, indexes)
    print(result)
    batman = 'ilovebatmanandbatman'
    indx = [1,2,5,6]
    result = closest_duplicates(batman, indx)
    print(result)
    batman = 'iloveabatmanbatmanandbatman'
    indx = [7]
    result = closest_duplicates(batman, indx)
    print(result)

PMende · Answer 3 · 2018-10-13T19:25:49.607

This gets the indices of all characters-of-interest before we start looking for the closest matches. We can then avoid redundant calculations, and also do simple look-ups in the case where a character only occurs once or twice:

from collections import defaultdict
my_str = 'shroijsfrondhslmbs'
query = [4, 2, 11]

def closest_matches(in_str, query):
    closest = []
    character_positions = defaultdict(list)
    valid_chars = {in_str[idx] for idx in query}
    for i, character in enumerate(in_str):
        if character not in valid_chars:
            continue
        character_positions[character].append(i)
    for idx in query:
        char = in_str[idx]
        if len(character_positions[char]) is 1:
            closest.append(-1)
            continue
        elif len(character_positions[char]) is 2:
            closest.append(next(idx_i for idx_i in character_positions[char] if idx_i is not idx))
            continue
        shortest_dist = min(abs(idx_i - idx) for idx_i in character_positions[char] if idx_i is not idx)
        closest_match = next(idx_i for idx_i in character_positions[char] if abs(idx_i - idx) == shortest_dist)
        closest.append(closest_match)
    return closest

closest_matches(my_str, query)

Output: [-1, 8, -1]

s = 'adarshravi'
queries = [2, 4]
closest_matches(s, queries)

Output: [0, -1]

Some timings:

%timeit closest_matches(my_str, query)

Results: 8.98 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Comparing to Willem's answer:

%timeit list(closest(my_str, query))

Results: 55.8 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Comparing to your original answer:

%timeit closest(my_str, query)

Results: 11.4 µs ± 352 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So you're already doing pretty well!

score 1 · Answer 4 · answered Oct 13 '18 at 19:11

This works by creating tuples with indexes, then comparing the abs value of the difference of the two indexes if the char in the tuple are the same. When creating s_lst the tuples from queries are left out to avoid matching with itself

s = 'adarshravi'
queries = [2, 4]
queries = [(i, s[i]) for i in queries]

s_lst = [(i, v) for i, v in enumerate(s) if any(v in x for x in queries)]
s_lst = [i for i in s_lst if not any(i[0] in x for x in queries)]

res = []
for i in queries:
    if not any(i[1] in x for x in s_lst):
        res.append(-1)
    else:
        close = None
        for j in s_lst:
            if j[1] == i[1] and close == None:
                close = j
            elif abs(j[0] - i[0]) < abs(close[0] - i[0]):
                close = j
        res.append(close[0])

print(res)
# [0, -1]

Edgar Hernandez · Answer 5 · 2018-10-13T23:08:46.257

It is extremely probable the there is a more optimal solution to this problem that the one I reach below, but I wanted to show how I would approach the optimization of this code if I was assigned to this task. Also, I haven't executed any part of the code so you may find some syntactic errors on it.

============================================================================

Lets say that len(s) == n, and len(queries) == m.

Your current code is doing the following:

For each query, q:
  1. find the character of the query, c
  2. find the indices of other characters in the string that match c
  3. find the closest index to the original index with the same character as the original index

Steps 1-3 are executed m times, because there are m queries. And steps 2 and 3 have to iterate over the whole string s (in worst case your string s consist of the same character), so it executes n steps.

So you are roughly executing 2n + 1 steps for each query, so in total you are executing roughly (2n + 1) * m steps. This is (almost) what is called the runtime complexity of your algorithm. In big-O notation, the complexity would be: O(n*m).

Lets extract steps 2 and 3 into its own functions:

def findIdxListByPos(s, i):
  foundidx = []
  srchChr = s[i]

  for j in range(0, len(s)):
      if s[j] == srchChr:
        foundidx.append(j)

  return foundIdx

def findClosestIndex(foundidx, i):
  # this is not needed because if the character appeared only once,
  # foundidx will be empty and the "for fnditem in foundidx" will not
  # do anything, so you can remove it
  if len(foundidx) < 2:
      return -1

  lastIdx = -1
  dist = 0
  foundidx.remove(i)

  for fnditem in foundidx:
    if dist == 0:
      lastIdx = fnditem
      dist = abs(fnditem - i)
    else:
      if abs(fnditem - i) < dist:
        lastIdx = fnditem
        dist = abs(fnditem - i)

  return lastIdx

def closest(s, queries):
  s = s.lower()
  listIdx = []

  for i in queries:
    foundidx = findIdxListByPos(s, i)
    lastIdx = findClosestIndex(foundidx, i)

    listIdx.append(lastIdx)

  return listIdx

You can see that in findIdxListByPos, you are always looking at every position in the string.

Now, lets say that you have a case where queries = [1, 1], then your are calculating two times the same foundidx and the same lastIdx. We can save that calculations and reuse them. That is, you save your foundidx and lastIdx inside another variables that are not lost after each query. You can do that in a dictionary with the character of the query as the key. If you have already calculated that key, you don't calculate again, just reuse it.

Your code will look like this:

def findIdxListByPos(s, i):
  foundidx = []
  srchChr = s[i]

  for j in range(0, len(s)):
      if s[j] == srchChr:
        foundidx.append(j)

  return foundIdx

def findClosestIndex(foundidx, i):
  lastIdx = -1
  dist = 0
  foundidx.remove(i)

  for fnditem in foundidx:
    if dist == 0:
      lastIdx = fnditem
      dist = abs(fnditem - i)
    else:
      if abs(fnditem - i) < dist:
        lastIdx = fnditem
        dist = abs(fnditem - i)

  return lastIdx

def calculateQueryResult(s, i, allFoundIdx):
  srchChr = s[i]
  if srchChr not in allFoundIdx:
    allFoundIdx[srchChr] = findIdxListByPos(s, i)

  foundidx = allFoundIdx[srchChr]

  return findClosestIndex(foundidx, i)

def closest(s, queries):
  s = s.lower()
  listIdx = []
  allFoundIdx = {}
  queriesResults = {}

  for i in queries:
    if i not in queriesResults:
      queriesResults[i] = calculateQueryResult(s, i, allFoundIdx)

    listIdx.append(queriesResults[i])

return listIdx

This change increases the memory used by your algorithm and changes a little its runtime complexity.

Now, in the worst case you don't have any duplicates in your queries. What happens when you don't have duplicate queries? You have a query for each element in s and all elements in s are distinct!

queries = [0,1,2,...,n] So len(queries) == n, so n == m then your algorithm has now a complexity of O(n*n) = O(n^2)

Now, you can see that in this worst case scenario, your allFoundIdx dictionary will contain all the characters with all the positions in the string. So, memory wise is equivalent to calculating this dictionary upfront for all the values in the string. Calculating all upfront does not improve the runtime complexity but it does not make it worse either.

def findClosestIndex(foundidx, i):
  lastIdx = -1
  dist = 0
  foundidx.remove(i)

  for fnditem in foundidx:
    if dist == 0:
      lastIdx = fnditem
      dist = abs(fnditem - i)
    else:
      if abs(fnditem - i) < dist:
        lastIdx = fnditem
        dist = abs(fnditem - i)

  return lastIdx

def calculateAllFoundIdx(s):
  allFoundIdx = {}
  for i in range(0, len(s)):
    srchChr = s[i]

    # you should read about the get method of dictionaries. This will 
    # return an empty list if there is no value for the key srchChr in the
    # dictionary 
    allFoundIdx[srchChr] = allFoundIdx.get(srchChr, []).append(i)

  return allFoundIdx

def closest(s, queries):
  s = s.lower()
  listIdx = []
  queriesResults = {}

  # this has complexity O(n)
  allFoundIdx = calculateAllFoundIdx(s)

  # this still has complexity O(n^2) because findClosestIndex still has O(n)
  # the for loop executes it n times
  for i in queries:
    if i not in queriesResults:
      srchChr = s[i]
      foundidx = allFoundIdx[srchChr]
      queriesResults[i] = findClosestIndex(foundidx, i)

    listIdx.append(queriesResults[i])

return listIdx

This algorithm is still O(n^2) but now you just need to optimize the findClosestIndex function, as there is no way to not iterate over all the queries.

So, in findClosestIndex you are getting as parameters a list of numbers (the positions of some character in the original string) that is ordered incrementally (because of the way it was constructed), and another number of which you want to find the closest (this number is guaranteed to be included in the list).

The closest number (because the list is ordered) has to be the previous or the next one in the list. Any other number is "farther away" that these two.

So basically, you want to find the index of this number in the list and then the previous and next elements in the list, and compare their distances and return the smallest.

To find a number in an ordered list, you use binary search and you just have to be careful with the indices to obtain your final result:

def binSearch(foundidx, idx):
  hi = len(foundidx) - 1
  lo = 0

  while lo <= hi:
    m = (hi + lo) / 2
    if foundidx[m] < idx:
      lo = m + 1
    elif found[m] > idx:
      hi = m - 1
    else:
      return m

 # should never get here as we are sure the idx is in foundidx
 return -1 

def findClosestIndex(foundidx, idx):
  if len(foundidx) == 1:
    return -1

  pos = binSearch(foundidx, idx)

  if pos == 0:
    return foundidx[pos + 1]

  if pos == len(foundidx) - 1:
    return foundidx[pos - 1]

  prevDist = abs(foundidx[pos - 1] - idx)
  postDist = abs(foundidx[pos + 1] - idx)

  if prevDist <= postDist:
    return pos - 1

  return pos + 1

def calculateAllFoundIdx(s):
  allFoundIdx = {}
  for i in range(0, len(s)):
    srchChr = s[i]

    # you should read about the get method of dictionaries. This will 
    # return an empty array if there is no value for the key srchChr in the
    # dictionary 
    allFoundIdx[srchChr] = allFoundIdx.get(srchChr, []).append(i)

  return allFoundIdx

def closest(s, queries):
  s = s.lower()
  listIdx = []
  queriesResults = {}

  # this has complexity O(n)
  allFoundIdx = calculateAllFoundIdx(s)

  # this has now complexity O(n*log(n)) because findClosestIndex now has O(log(n))
  for i in queries:
    if i not in queriesResults:
      srchChr = s[i]
      foundidx = allFoundIdx[srchChr]
      queriesResults[i] = findClosestIndex(foundidx, i)

    listIdx.append(queriesResults[i])

  return listIdx

Now findClosestIndex has a complexity of O(log(n)), so closest now has a complexity of O(n*log(n)).

The worst case now is when all the elements in s are the same, and queries = [0, 1, ..., len(s) - 1]

user3713719 · Answer 6 · 2018-10-13T18:13:00.220

-1

s = 'adarshravi'
result = list()
indexes = [2, 4]
for index in indexes:
    c = s[index]
    back = index - 1
    forward = index + 1
    r = -1
    while (back >= 0 or forward < len(s)):
        if back >= 0 and c == s[back]:
            r = back
            break
        if forward < len(s) and c == s[forward]:
            r = forward
            break
        back -= 1
        forward += 1
    result.append(r)

print result

edited Oct 13 '18 at 18:13

answered Oct 13 '18 at 18:06

user3713719

325
3
8

Did you try for any other string? – Adarsh Ravi Oct 13 '18 at 18:18

Closest character in a string using python

6 Answers6