394

How do I find a string between two substrings ('123STRINGabc' -> 'STRING')?

My current method is like this:

>>> start = 'asdf=5;'
>>> end = '123jasd'
>>> s = 'asdf=5;iwantthis123jasd'
>>> print((s.split(start))[1].split(end)[0])
iwantthis

However, this seems very inefficient and un-pythonic. What is a better way to do something like this?

Forgot to mention: The string might not start and end with start and end. They may have more characters before and after.

John Howard
  • 61,037
  • 23
  • 50
  • 66
  • 2
    Your additional information makes it almost necessary to use regexes for maximum correctness. – Jesse Dhillon Jul 30 '10 at 06:39
  • 31
    What's wrong with your own solution? I actually prefer it to the one you accepted. – reubano Nov 10 '14 at 12:06
  • I was trying to do this as well but for multiple instances it looks like using *? to do a non greedy search and then just cutting off the string with s[s.find(end)] worked for tracking multiple instances – lathomas64 Jan 09 '19 at 23:07
  • 1
    @reubano: one feature/bug of this code is that it does not raise an exception when the end text does not occur in the original text. The accepted answer fixes this. – Kasper Dokter Jan 19 '22 at 14:50
  • just a note: `s[1:-1]` will also do what you had.. though i like `.group(1)` or `(.*?)` non-greedy from below better – alchemy Oct 30 '22 at 23:04

20 Answers20

500
import re

s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))
andilabs
  • 22,159
  • 14
  • 114
  • 151
Nikolaus Gradwohl
  • 19,708
  • 3
  • 45
  • 61
185
s = "123123STRINGabcabc"

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

def find_between_r( s, first, last ):
    try:
        start = s.rindex( first ) + len( first )
        end = s.rindex( last, start )
        return s[start:end]
    except ValueError:
        return ""


print find_between( s, "123", "abc" )
print find_between_r( s, "123", "abc" )

gives:

123STRING
STRINGabc

I thought it should be noted - depending on what behavior you need, you can mix index and rindex calls or go with one of the above versions (it's equivalent of regex (.*) and (.*?) groups).

cji
  • 6,635
  • 2
  • 20
  • 16
  • 45
    He said that he wanted a way that was more Pythonic, and this is decidedly less so. I'm not sure why this answer was picked, even OP's own solution is better. – Jesse Dhillon Jul 30 '10 at 06:37
  • 2
    Agreed. I'd use the solution by @Tim McNamara , or the suggestion by the same of something like `start+test+end in substring` – jdd Jul 30 '10 at 12:31
  • Right, so it's less pythonic, ok. Is it less efficient than regexps too? And there's also @Prabhu answer you need to downvote, as it suggest the same solution. – cji Jul 30 '10 at 19:42
  • 1
    +1 too, for a more generic and reusable (by import) solution. – Ida Jun 24 '13 at 10:30
  • 3
    +1 since it works better than the other solutions in the case where `end` is found more than once. But I do agree that the OP's solution is more simpler. – reubano Nov 10 '14 at 12:08
  • @cji , It say "object has no attribute 'index'" what do I need to import? – Preshan Pradeepa Oct 14 '16 at 07:13
  • @cji does your solution grabs the first occurence only ? for example if we have both s = "123123STRINGabcabc" and s = "123123STRINGabc2abc" it will output STRINGabc only and ignore STRINGabc2? – user1788736 Dec 25 '16 at 13:51
139
start = 'asdf=5;'
end = '123jasd'
s = 'asdf=5;iwantthis123jasd'
print s[s.find(start)+len(start):s.rfind(end)]

gives

iwantthis
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
ansetou
  • 1,531
  • 1
  • 9
  • 5
  • 5
    I upvoted this because it works regardless of input string size. Some of the other methods assumed you'd know the length ahead of time. – Kenny Powers Jan 11 '17 at 03:16
  • 2
    yes it works by without input size however it does assume the string exists – Kevin Crum Feb 03 '21 at 02:31
  • This however extracts the string between the first and the LAST occurrence of the 2nd string, which may be incorrect, especially when parsing HTML. Unfortunately, this question appears closed so I cannot post my answer. – Lenka Pitonakova Jun 13 '23 at 18:40
63
s[len(start):-len(end)]
Tim McNamara
  • 18,019
  • 4
  • 52
  • 83
  • 14
    This is very nice, assuming start and end are always at the start and end of the string. Otherwise, I would probably use a regex. – jdd Jul 30 '10 at 06:01
  • 3
    I went the most Pythonic answer to the original question I could think of. Testing using the `in` operator would probably be faster than regexp. – Tim McNamara Jul 30 '10 at 06:13
41

Just converting the OP's own solution into an answer:

def find_between(s, start, end):
    return s.split(start)[1].split(end)[0]
Despe1990
  • 595
  • 1
  • 3
  • 21
reubano
  • 5,087
  • 1
  • 42
  • 41
39

String formatting adds some flexibility to what Nikolaus Gradwohl suggested. start and end can now be amended as desired.

import re

s = 'asdf=5;iwantthis123jasd'
start = 'asdf=5;'
end = '123jasd'

result = re.search('%s(.*)%s' % (start, end), s).group(1)
print(result)
Ooker
  • 1,969
  • 4
  • 28
  • 58
Tim McNamara
  • 18,019
  • 4
  • 52
  • 83
32

If you don't want to import anything, try the string method .index():

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

# Output: 'string'
print(text[text.index(left)+len(left):text.index(right)])
Fernando Wittmann
  • 1,991
  • 20
  • 16
  • 4
    I am loving it. simple, single-line, clear enough, no additional imports and works out of the box. I have no idea what is the deal with the over-engineered answers above. – PaulB Sep 12 '19 at 09:04
  • 1
    This is not checking whether the "right" text is actually at the right side of the text. If there are any occurrences of "right" before the text it won't work. – AndreFeijo Jun 20 '20 at 08:17
  • 1
    @AndreFeijo I agree with you, this was my first solution when trying to extract texts and I wanted to avoid regex weird syntax. However, in situations as you mentioned, I would use regex instead. – Fernando Wittmann Jul 10 '20 at 15:05
  • in that case (not all of cases) you could find left and then right, although it's a two line code `text = text[text.index(left)+len(left):len(role)]` `text = text[0:text.index(right)]` – ericksho Jul 27 '22 at 19:45
  • Hi Fernando, for this text "ADRIANOPICCININIC216186162022-07-27 09:36:33Z" i am looking to extract only "C21618616", how can i do that? – Arun Mohan Aug 11 '22 at 08:34
16
source='your token _here0@df and maybe _here1@df or maybe _here2@df'
start_sep='_'
end_sep='@df'
result=[]
tmp=source.split(start_sep)
for par in tmp:
  if end_sep in par:
    result.append(par.split(end_sep)[0])

print result

must show: here0, here1, here2

the regex is better but it will require additional lib an you may want to go for python only

tstoev
  • 1,415
  • 11
  • 12
15

Here is one way to do it

_,_,rest = s.partition(start)
result,_,_ = rest.partition(end)
print result

Another way using regexp

import re
print re.findall(re.escape(start)+"(.*)"+re.escape(end),s)[0]

or

print re.search(re.escape(start)+"(.*)"+re.escape(end),s).group(1)
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
6

Here is a function I did to return a list with a string(s) inbetween string1 and string2 searched.

def GetListOfSubstrings(stringSubject,string1,string2):
    MyList = []
    intstart=0
    strlength=len(stringSubject)
    continueloop = 1

    while(intstart < strlength and continueloop == 1):
        intindex1=stringSubject.find(string1,intstart)
        if(intindex1 != -1): #The substring was found, lets proceed
            intindex1 = intindex1+len(string1)
            intindex2 = stringSubject.find(string2,intindex1)
            if(intindex2 != -1):
                subsequence=stringSubject[intindex1:intindex2]
                MyList.append(subsequence)
                intstart=intindex2+len(string2)
            else:
                continueloop=0
        else:
            continueloop=0
    return MyList


#Usage Example
mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y68")
for x in range(0, len(List)):
               print(List[x])
output:


mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","3")
for x in range(0, len(List)):
              print(List[x])
output:
    2
    2
    2
    2

mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y")
for x in range(0, len(List)):
               print(List[x])
output:
23
23o123pp123
Mnyikka
  • 1,223
  • 17
  • 12
5

To extract STRING, try:

myString = '123STRINGabc'
startString = '123'
endString = 'abc'

mySubString=myString[myString.find(startString)+len(startString):myString.find(endString)]
4

You can simply use this code or copy the function below. All neatly in one line.

def substring(whole, sub1, sub2):
    return whole[whole.index(sub1) : whole.index(sub2)]

If you run the function as follows.

print(substring("5+(5*2)+2", "(", "("))

You will pobably be left with the output:

(5*2

rather than

5*2

If you want to have the sub-strings on the end of the output the code must look like below.

return whole[whole.index(sub1) : whole.index(sub2) + 1]

But if you don't want the substrings on the end the +1 must be on the first value.

return whole[whole.index(sub1) + 1 : whole.index(sub2)]
3

These solutions assume the start string and final string are different. Here is a solution I use for an entire file when the initial and final indicators are the same, assuming the entire file is read using readlines():

def extractstring(line,flag='$'):
    if flag in line: # $ is the flag
        dex1=line.index(flag)
        subline=line[dex1+1:-1] #leave out flag (+1) to end of line
        dex2=subline.index(flag)
        string=subline[0:dex2].strip() #does not include last flag, strip whitespace
    return(string)

Example:

lines=['asdf 1qr3 qtqay 45q at $A NEWT?$ asdfa afeasd',
    'afafoaltat $I GOT BETTER!$ derpity derp derp']
for line in lines:
    string=extractstring(line,flag='$')
    print(string)

Gives:

A NEWT?
I GOT BETTER!
Wesley Kitlasten
  • 417
  • 5
  • 15
2

This is essentially cji's answer - Jul 30 '10 at 5:58. I changed the try except structure for a little more clarity on what was causing the exception.

def find_between( inputStr, firstSubstr, lastSubstr ):
'''
find between firstSubstr and lastSubstr in inputStr  STARTING FROM THE LEFT
    http://stackoverflow.com/questions/3368969/find-string-between-two-substrings
        above also has a func that does this FROM THE RIGHT   
'''
start, end = (-1,-1)
try:
    start = inputStr.index( firstSubstr ) + len( firstSubstr )
except ValueError:
    print '    ValueError: ',
    print "firstSubstr=%s  -  "%( firstSubstr ), 
    print sys.exc_info()[1]

try:
    end = inputStr.index( lastSubstr, start )       
except ValueError:
    print '    ValueError: ',
    print "lastSubstr=%s  -  "%( lastSubstr ), 
    print sys.exc_info()[1]

return inputStr[start:end]    
2
from timeit import timeit
from re import search, DOTALL


def partition_find(string, start, end):
    return string.partition(start)[2].rpartition(end)[0]


def re_find(string, start, end):
    # applying re.escape to start and end would be safer
    return search(start + '(.*)' + end, string, DOTALL).group(1)


def index_find(string, start, end):
    return string[string.find(start) + len(start):string.rfind(end)]


# The wikitext of "Alan Turing law" article form English Wikipeida
# https://en.wikipedia.org/w/index.php?title=Alan_Turing_law&action=edit&oldid=763725886
string = """..."""
start = '==Proposals=='
end = '==Rival bills=='

assert index_find(string, start, end) \
       == partition_find(string, start, end) \
       == re_find(string, start, end)

print('index_find', timeit(
    'index_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

print('partition_find', timeit(
    'partition_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

print('re_find', timeit(
    're_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

Result:

index_find 0.35047444528454114
partition_find 0.5327825636197754
re_find 7.552149639286381

re_find was almost 20 times slower than index_find in this example.

AXO
  • 8,198
  • 6
  • 62
  • 63
1

My method will be to do something like,

find index of start string in s => i
find index of end string in s => j

substring = substring(i+len(start) to j-1)
josh
  • 13,793
  • 12
  • 49
  • 58
1

This I posted before as code snippet in Daniweb:

# picking up piece of string between separators
# function using partition, like partition, but drops the separators
def between(left,right,s):
    before,_,a = s.partition(left)
    a,_,after = a.partition(right)
    return before,a,after

s = "bla bla blaa <a>data</a> lsdjfasdjöf (important notice) 'Daniweb forum' tcha tcha tchaa"
print between('<a>','</a>',s)
print between('(',')',s)
print between("'","'",s)

""" Output:
('bla bla blaa ', 'data', " lsdjfasdj\xc3\xb6f (important notice) 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f ', 'important notice', " 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f (important notice) ', 'Daniweb forum', ' tcha tcha tchaa')
"""
Tony Veijalainen
  • 5,447
  • 23
  • 31
1

Parsing text with delimiters from different email platforms posed a larger-sized version of this problem. They generally have a START and a STOP. Delimiter characters for wildcards kept choking regex. The problem with split is mentioned here & elsewhere - oops, delimiter character gone. It occurred to me to use replace() to give split() something else to consume. Chunk of code:

nuke = '~~~'
start = '|*'
stop = '*|'
julien = (textIn.replace(start,nuke + start).replace(stop,stop + nuke).split(nuke))
keep = [chunk for chunk in julien if start in chunk and stop in chunk]
logging.info('keep: %s',keep)
Matthew Dunn
  • 135
  • 5
0

Further from Nikolaus Gradwohl answer, I needed to get version number (i.e., 0.0.2) between('ui:' and '-') from below file content (filename: docker-compose.yml):

    version: '3.1'
services:
  ui:
    image: repo-pkg.dev.io:21/website/ui:0.0.2-QA1
    #network_mode: host
    ports:
      - 443:9999
    ulimits:
      nofile:test

and this is how it worked for me (python script):

import re, sys

f = open('docker-compose.yml', 'r')
lines = f.read()
result = re.search('ui:(.*)-', lines)
print result.group(1)


Result:
0.0.2
Akshay
  • 169
  • 1
  • 4
-3

This seems much more straight forward to me:

import re

s = 'asdf=5;iwantthis123jasd'
x= re.search('iwantthis',s)
print(s[x.start():x.end()])
  • This requires you to know the string you're looking for, it doesn't find whatever string is between the two substrings, as the OP requested. The OP wants to be able to get the middle no matter what it is, and this answer would require you to know the middle before you start. – Korzak May 09 '19 at 20:22