Python - find index position of first occurrence of a list of strings within a string

Question

I would like to search some text for the index of the first occurrence of a set of strings (say "-->" or "--x" or "--XX") once found, I would need to know where the start position of the found string, and the particular string that was found (more specifically the length of the identified string)

This is what i have so far.. but its not enough. Please help.

arrowlist = {"->x","->","->>","-\","\\-","//--","->o","o\\--","<->","<->o"}
def cxn(line,arrowlist):
   if any(x in line for x in arrowlist):
      print("found an arrow {} at position {}".format(line.find(arrowlist),2))
   else:
      return 0

maybe regex would be easier, but i'm really struggling since the arrow list could be dynamic and the length of the arrow strings could also be variable.

Thanks!

You have `"-\"` in your list of patterns. Do you want this to match with a literal -\ in your line? If yes, you will have to escape it as such: `"-\\"`. The same will have to be done for all the patterns if they're supposed to get matched literally — entropy, Mar 07 '19 at 07:00

David · Answer 1 · 2019-03-08T13:29:30.657

1

I like this solution, inspired from this post:

How to use re match objects in a list comprehension

import re

arrowlist = ["xxx->x", "->", "->>", "-\"","\\-"," // --","x->o", "-> ->"]

lines = ["xxx->x->->", "-> ->", "xxx->x", "xxxx->o"]

def filterPick(list,filter):
    return [(m.group(), item_number, m.start()) for item_number,l in enumerate(list) for m in (filter(l),) if m]


if __name__ == '__main__':

    searchRegex = re.compile(r''+ '|'.join(arrowlist) ).search
    x = filterPick(lines, searchRegex)
    print(x)

Output shows:

[('xxx->x', 0, 0), ('->', 1, 0), ('xxx->x', 2, 0), ('x->o', 3, 3)]

First number being the list index and second the start index of the string.

edited Mar 08 '19 at 13:29

answered Mar 07 '19 at 07:14

David

2,926
1
27
61

very elegant solution, however, i dont see that the regEx expression can match the full spectrum of the arrowlist at the same time? – Hightower Mar 08 '19 at 01:57
Can you elaborate on that, please? What exactly is the problem? – David Mar 08 '19 at 03:19
maybe its my basic skills, but the "searchRegex = re.compile(r'->').search" statement is looking for the '->' string and not for any string in the arrowlist string – Hightower Mar 08 '19 at 04:21
I edited, please check, if it serves your purpose. Do you want every arrow found in a line or just the first out of the list? – David Mar 08 '19 at 15:11
Awesome... In my case there will be only 1 per line so just the first. – Hightower Mar 09 '19 at 21:36
Does this solution fit your purpose then? Or do you need to find another 'o->' for example if an 'x->xx' was already found in one line? – David Mar 10 '19 at 02:12

etherwar · Accepted Answer · 2019-03-19T07:16:11.750

Following along with your example's logic, this jumped out as the most expedient method of finding the "first" matching arrow and printing it's location. However, the order of sets are not FIFO, so if you want to preserve order I would suggest substituting a list instead of a set for arrowlist so that the order can be preserved.

    arrowlist = {"->x","->", "->>", "-\\", "\\-","//--","->o","o\\--","<->","<->o"}
    def cxn(line, arrowlist):
       try:
           result = tuple((x, line.find(x)) for x in arrowlist if x in line)[0]
           print("found an arrow {} at position {} with length {}".format(result[0], result[1], len(result[0])))

       # Remember in general it's not a great idea to use an exception as
       # broad as Exception, this is just for example purposes.
       except Exception:
          return 0

If you're looking for the first match in the provided string (line), you can do that like this:

arrowlist = {"->x","->", "->>", "-\\", "\\-","//--","->o","o\\--","<->","<->o"}

def cxn(line, arrowlist):
   try:
       # key first sorts on the position in string then shortest length 
       # to account for multiple arrow matches (i.e. -> and ->x)
       result = sorted([(x, line.find(x)) for x in arrowlist if x in line], key=lambda r: (r[1],len(r[0])))[0]
       # if you would like to match the "most complete" (i.e. longest-length) word first use:
       # result = sorted([(x, line.find(x)) for x in arrowlist if x in line], key=lambda r: (r[1], -len(r[0])))[0]
       print("found an arrow {} at position {} with length {}".format(result[0], result[1], len(result[0])))

   except Exception:
      return 0

Or, if you have access to the standard library you can use operator.itemgetter to almost the same effect and gain efficiency from less function calls:

from operator import itemgetter

arrowlist = {"->x","->", "->>", "-\\", "\\-","//--","->o","o\\--","<->","<->o"}

def cxn(line, arrowlist):
   try:
       # key first sorts on the position in string then alphanumerically 
       # on the arrow match (i.e. -> and ->x matched in same position
       # will return -> because when sorted alphanumerically it is first)
       result = sorted([(x, line.find(x)) for x in arrowlist if x in line], key=(itemgetter(1,0)))[0]
       print("found an arrow {} at position {} with length {}".format(result[0], result[1], len(result[0])))

   except Exception:
      return 0

***NOTE: I am using a slightly different arrowlist than your example just because the one you provided seems to be messing with the default code formatting (likely because of quote closure issues). Remember you can prepend a string with 'r' like this: r"Text that can use special symbols like the escape \and\ be read in as a 'raw' string literal\". See this question for more information about raw string literals.

only problem with this solution, is that the matched string is not an exact word match, i.e. if the line is "hello ->>", the item returned is "->" since that is the first matched word. — Hightower, Mar 13 '19 at 04:59
@Hightower if you'd like to sort based on "longest word match", substitute the line above for the following: result = sorted([(x, line.find(x)) for x in arrowlist if x in line], key=lambda r: (r[1],-len(r[0])))[0] --- All that I changed was putting a "-" (without quotes) before the second lambda function argument. From the middle solution provided above. — etherwar, Mar 15 '19 at 18:12
@Hightower I have added the changed line at the appropriate location in a comment in the solution above... — etherwar, Mar 15 '19 at 18:59

score 1 · Answer 3 · answered Mar 07 '19 at 07:54

1

You could do something like

count = 0
for item in arrowlist:
    count += 1
    if item in line:
        print("found an arrow {} at position {}".format(item,count))

answered Mar 07 '19 at 07:54

kr8gz

358
2
12

score 0 · Answer 4 · answered Mar 13 '19 at 23:12

wanted to post the answer that I came up with (from the combination of feedback) as you can see, this result -- be it really verbose and very inefficient will return the correct arrow string found at the correct position index. --

arrowlist = ["xxx->x", "->", "->>", "xxx->x","x->o", "xxx->"]
doc =""" @startuml
    n1 xxx->xx n2 : should not find
    n1 ->> n2 : must get the third arrow
    n2  xxx-> n3 : last item
    n3   -> n4 : second item
    n4    ->> n1 : third item"""

def checkForArrow(arrows,line):
    for a in arrows:
        words = line.split(' ')
        for word in words:
            if word == a:
                return(arrows.index(a),word,line.index(word))

for line in iter(doc.splitlines()):
    line = line.strip()
    if line != "":
        print (checkForArrow(arrowlist,line))

returns the following results: (index of item in arrowlist, the string found, index position of text in the line)

None
None
(2, '->>', 3)
(5, 'xxx->', 4)
(1, '->', 5)
(2, '->>', 6)

Python - find index position of first occurrence of a list of strings within a string

4 Answers4