Find string between two substrings

Question

How do I find a string between two substrings ('123STRINGabc' -> 'STRING')?

My current method is like this:

>>> start = 'asdf=5;'
>>> end = '123jasd'
>>> s = 'asdf=5;iwantthis123jasd'
>>> print((s.split(start))[1].split(end)[0])
iwantthis

However, this seems very inefficient and un-pythonic. What is a better way to do something like this?

Forgot to mention: The string might not start and end with start and end. They may have more characters before and after.

Your additional information makes it almost necessary to use regexes for maximum correctness. — Jesse Dhillon, Jul 30 '10 at 06:39
What's wrong with your own solution? I actually prefer it to the one you accepted. — reubano, Nov 10 '14 at 12:06
I was trying to do this as well but for multiple instances it looks like using *? to do a non greedy search and then just cutting off the string with s[s.find(end)] worked for tracking multiple instances — lathomas64, Jan 09 '19 at 23:07
@reubano: one feature/bug of this code is that it does not raise an exception when the end text does not occur in the original text. The accepted answer fixes this. — Kasper Dokter, Jan 19 '22 at 14:50
just a note: `s[1:-1]` will also do what you had.. though i like `.group(1)` or `(.*?)` non-greedy from below better — alchemy, Oct 30 '22 at 23:04

score 500 · Accepted Answer · edited Apr 26 '19 at 15:51

500

import re

s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))

edited Apr 26 '19 at 15:51

andilabs

22,159
14
114
151

answered Jul 30 '10 at 05:59

Nikolaus Gradwohl

19,708
3
45
61

1

@Jesse Dhillon -- what about @Tim McNamara's suggestion of something like `''.join(start,test,end) in a_string`? – jdd Jul 30 '10 at 13:13
This method is shorter and is similar to the javascript method. – leonneo Dec 07 '13 at 10:42
Will this work if there are spaces in the start string and the end string? – chishaku Feb 05 '15 at 07:30
works fine with spaces in start and end string – Nikolaus Gradwohl Feb 05 '15 at 10:20
maybe something like `pat = re.compile('asdf=5;(.*)123jasd')` would make `pat.search(s).group(1)` more reusable. – tipanverella Mar 31 '16 at 13:24
`print(result.match)` doesn't work (delete comment) – alchemy Jan 14 '19 at 04:49
7

What if I need to find between 2 substrings and the second one is repeated after first one? Something like this: s= 'asdf=5;I_WANT_ONLY_THIS123jasdNOT_THIS123jasd – Sep 19 '19 at 15:13
8

Add `?` to make it non greedy `result = re.search('asdf=5;(.*?)123jasd', s)` – do-ic Nov 21 '20 at 15:13
1

How can this be amended to select data between start/end if the start/end is duplicated? e.g. say i wanted to select both strings separately between <> `i would like to send to ` and return `result1='message' `and `result2 = 'name'` – Sql_Pete_Belfast Mar 05 '21 at 11:56
@Sql_Pete_Belfast I would like to know this as well. – sammosummo Dec 16 '21 at 00:25
@Sql_Pete_Belfast you could use re.findall() instead of re.search(). re.findall returns a list of matching strings – Chris Searcy Feb 02 '23 at 05:38
Can there be more than 2 results? – alper May 31 '23 at 14:31
This however extracts the string between the first and the LAST occurrence of the 2nd string, which may be incorrect, especially when parsing HTML. Unfortunately, this question appears closed so I cannot post my answer. – Lenka Pitonakova Jun 13 '23 at 18:39

cji · Answer 2 · 2010-07-30T06:27:01.023

185

s = "123123STRINGabcabc"

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

def find_between_r( s, first, last ):
    try:
        start = s.rindex( first ) + len( first )
        end = s.rindex( last, start )
        return s[start:end]
    except ValueError:
        return ""


print find_between( s, "123", "abc" )
print find_between_r( s, "123", "abc" )

gives:

123STRING
STRINGabc

I thought it should be noted - depending on what behavior you need, you can mix index and rindex calls or go with one of the above versions (it's equivalent of regex (.*) and (.*?) groups).

edited Jul 30 '10 at 06:27

answered Jul 30 '10 at 05:58

cji

6,635
2
20
16

45

He said that he wanted a way that was more Pythonic, and this is decidedly less so. I'm not sure why this answer was picked, even OP's own solution is better. – Jesse Dhillon Jul 30 '10 at 06:37
2

Agreed. I'd use the solution by @Tim McNamara , or the suggestion by the same of something like `start+test+end in substring` – jdd Jul 30 '10 at 12:31
Right, so it's less pythonic, ok. Is it less efficient than regexps too? And there's also @Prabhu answer you need to downvote, as it suggest the same solution. – cji Jul 30 '10 at 19:42
1

+1 too, for a more generic and reusable (by import) solution. – Ida Jun 24 '13 at 10:30
3

+1 since it works better than the other solutions in the case where `end` is found more than once. But I do agree that the OP's solution is more simpler. – reubano Nov 10 '14 at 12:08
@cji , It say "object has no attribute 'index'" what do I need to import? – Preshan Pradeepa Oct 14 '16 at 07:13
@cji does your solution grabs the first occurence only ? for example if we have both s = "123123STRINGabcabc" and s = "123123STRINGabc2abc" it will output STRINGabc only and ignore STRINGabc2? – user1788736 Dec 25 '16 at 13:51

score 139 · Answer 3 · edited Jan 03 '17 at 12:29

139

start = 'asdf=5;'
end = '123jasd'
s = 'asdf=5;iwantthis123jasd'
print s[s.find(start)+len(start):s.rfind(end)]

gives

iwantthis

edited Jan 03 '17 at 12:29

David Arenburg

91,361
17
137
196

answered Sep 13 '13 at 15:54

ansetou

1,531
1
9
5

5

I upvoted this because it works regardless of input string size. Some of the other methods assumed you'd know the length ahead of time. – Kenny Powers Jan 11 '17 at 03:16
2

yes it works by without input size however it does assume the string exists – Kevin Crum Feb 03 '21 at 02:31
This however extracts the string between the first and the LAST occurrence of the 2nd string, which may be incorrect, especially when parsing HTML. Unfortunately, this question appears closed so I cannot post my answer. – Lenka Pitonakova Jun 13 '23 at 18:40

score 63 · Answer 4 · answered Jul 30 '10 at 05:56

63

s[len(start):-len(end)]

answered Jul 30 '10 at 05:56

Tim McNamara

18,019
4
52
83

14

This is very nice, assuming start and end are always at the start and end of the string. Otherwise, I would probably use a regex. – jdd Jul 30 '10 at 06:01
3

I went the most Pythonic answer to the original question I could think of. Testing using the `in` operator would probably be faster than regexp. – Tim McNamara Jul 30 '10 at 06:13

score 41 · Answer 5 · edited Jul 12 '23 at 14:07

41

Just converting the OP's own solution into an answer:

def find_between(s, start, end):
    return s.split(start)[1].split(end)[0]

edited Jul 12 '23 at 14:07

Despe1990

595
1
3
21

answered Nov 10 '14 at 12:10

reubano

5,087
1
42
41

12

If you are making someone else's solution as your own, you probably should make it a Community Wiki. – David Arenburg Jan 03 '17 at 12:35

score 39 · Answer 6 · edited Aug 27 '15 at 14:31

39

String formatting adds some flexibility to what Nikolaus Gradwohl suggested. start and end can now be amended as desired.

import re

s = 'asdf=5;iwantthis123jasd'
start = 'asdf=5;'
end = '123jasd'

result = re.search('%s(.*)%s' % (start, end), s).group(1)
print(result)

edited Aug 27 '15 at 14:31

Ooker

1,969
4
28
58

answered Jul 30 '10 at 07:47

Tim McNamara

18,019
4
52
83

2

I'm getting this: `'NoneType' object has no attribute 'group'` – Dentrax Jan 05 '19 at 10:36
1

That means a match wasn't found. Check your regular expression. – Tim McNamara Jan 10 '19 at 21:54
@Dentrax is right: should return nothing not an error – cwhisperer Aug 26 '20 at 15:47
I think Tim means that the search should return None as there were no matches. Since the search returned 'None', applying of .group(1) at the end causes the error. – MTay Sep 30 '20 at 21:28

Fernando Wittmann · Answer 7 · 2020-09-17T21:56:08.323

32

If you don't want to import anything, try the string method .index():

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

# Output: 'string'
print(text[text.index(left)+len(left):text.index(right)])

edited Sep 17 '20 at 21:56

answered Jul 21 '18 at 13:32

Fernando Wittmann

1,991
20
16

4

I am loving it. simple, single-line, clear enough, no additional imports and works out of the box. I have no idea what is the deal with the over-engineered answers above. – PaulB Sep 12 '19 at 09:04
1

This is not checking whether the "right" text is actually at the right side of the text. If there are any occurrences of "right" before the text it won't work. – AndreFeijo Jun 20 '20 at 08:17
1

@AndreFeijo I agree with you, this was my first solution when trying to extract texts and I wanted to avoid regex weird syntax. However, in situations as you mentioned, I would use regex instead. – Fernando Wittmann Jul 10 '20 at 15:05
in that case (not all of cases) you could find left and then right, although it's a two line code `text = text[text.index(left)+len(left):len(role)]` `text = text[0:text.index(right)]` – ericksho Jul 27 '22 at 19:45
Hi Fernando, for this text "ADRIANOPICCININIC216186162022-07-27 09:36:33Z" i am looking to extract only "C21618616", how can i do that? – Arun Mohan Aug 11 '22 at 08:34

score 16 · Answer 8 · answered Sep 24 '13 at 11:23

16

source='your token _here0@df and maybe _here1@df or maybe _here2@df'
start_sep='_'
end_sep='@df'
result=[]
tmp=source.split(start_sep)
for par in tmp:
  if end_sep in par:
    result.append(par.split(end_sep)[0])

print result

must show: here0, here1, here2

the regex is better but it will require additional lib an you may want to go for python only

answered Sep 24 '13 at 11:23

tstoev

1,415
11
12

This worked for me. Thank you for extending the solution for multiple occurrences. – Sterex Jan 24 '16 at 10:48
1

I was exactly looking for this, It helps for multiple occurrences, This post needs more upvotes :p. – ohsoifelse Jun 18 '19 at 16:08

John La Rooy · Answer 9 · 2010-07-30T06:03:29.703

15

Here is one way to do it

_,_,rest = s.partition(start)
result,_,_ = rest.partition(end)
print result

Another way using regexp

import re
print re.findall(re.escape(start)+"(.*)"+re.escape(end),s)[0]

or

print re.search(re.escape(start)+"(.*)"+re.escape(end),s).group(1)

edited Jul 30 '10 at 06:03

answered Jul 30 '10 at 05:58

John La Rooy

295,403
53
369
502

score 6 · Answer 10 · answered Jan 19 '18 at 08:37

Here is a function I did to return a list with a string(s) inbetween string1 and string2 searched.

def GetListOfSubstrings(stringSubject,string1,string2):
    MyList = []
    intstart=0
    strlength=len(stringSubject)
    continueloop = 1

    while(intstart < strlength and continueloop == 1):
        intindex1=stringSubject.find(string1,intstart)
        if(intindex1 != -1): #The substring was found, lets proceed
            intindex1 = intindex1+len(string1)
            intindex2 = stringSubject.find(string2,intindex1)
            if(intindex2 != -1):
                subsequence=stringSubject[intindex1:intindex2]
                MyList.append(subsequence)
                intstart=intindex2+len(string2)
            else:
                continueloop=0
        else:
            continueloop=0
    return MyList


#Usage Example
mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y68")
for x in range(0, len(List)):
               print(List[x])
output:


mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","3")
for x in range(0, len(List)):
              print(List[x])
output:
    2
    2
    2
    2

mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y")
for x in range(0, len(List)):
               print(List[x])
output:
23
23o123pp123

Really good and helpful answer. Thank you! – ibarant Jul 23 '19 at 15:16 — ibarant, Jul 23 '19 at 15:16
Extraordinary answer. I'd hire a guy like you – Abhishek Singh Jan 18 '21 at 18:38 — Abhishek Singh, Jan 18 '21 at 18:38

Reinstate Monica - Goodbye SE · Answer 11 · 2014-04-24T09:21:35.170

5

To extract STRING, try:

myString = '123STRINGabc'
startString = '123'
endString = 'abc'

mySubString=myString[myString.find(startString)+len(startString):myString.find(endString)]

edited Apr 24 '14 at 09:21

answered Feb 20 '13 at 11:51

Reinstate Monica - Goodbye SE

3,528
5
41
64

score 4 · Answer 12 · answered Jan 15 '17 at 10:28

You can simply use this code or copy the function below. All neatly in one line.

def substring(whole, sub1, sub2):
    return whole[whole.index(sub1) : whole.index(sub2)]

If you run the function as follows.

print(substring("5+(5*2)+2", "(", "("))

You will pobably be left with the output:

(5*2

rather than

5*2

If you want to have the sub-strings on the end of the output the code must look like below.

return whole[whole.index(sub1) : whole.index(sub2) + 1]

But if you don't want the substrings on the end the +1 must be on the first value.

return whole[whole.index(sub1) + 1 : whole.index(sub2)]

score 3 · Answer 13 · answered May 19 '16 at 18:51

These solutions assume the start string and final string are different. Here is a solution I use for an entire file when the initial and final indicators are the same, assuming the entire file is read using readlines():

def extractstring(line,flag='$'):
    if flag in line: # $ is the flag
        dex1=line.index(flag)
        subline=line[dex1+1:-1] #leave out flag (+1) to end of line
        dex2=subline.index(flag)
        string=subline[0:dex2].strip() #does not include last flag, strip whitespace
    return(string)

Example:

lines=['asdf 1qr3 qtqay 45q at $A NEWT?$ asdfa afeasd',
    'afafoaltat $I GOT BETTER!$ derpity derp derp']
for line in lines:
    string=extractstring(line,flag='$')
    print(string)

Gives:

A NEWT?
I GOT BETTER!

score 2 · Answer 14 · answered Jan 10 '15 at 20:01

This is essentially cji's answer - Jul 30 '10 at 5:58. I changed the try except structure for a little more clarity on what was causing the exception.

def find_between( inputStr, firstSubstr, lastSubstr ):
'''
find between firstSubstr and lastSubstr in inputStr  STARTING FROM THE LEFT
    http://stackoverflow.com/questions/3368969/find-string-between-two-substrings
        above also has a func that does this FROM THE RIGHT   
'''
start, end = (-1,-1)
try:
    start = inputStr.index( firstSubstr ) + len( firstSubstr )
except ValueError:
    print '    ValueError: ',
    print "firstSubstr=%s  -  "%( firstSubstr ), 
    print sys.exc_info()[1]

try:
    end = inputStr.index( lastSubstr, start )       
except ValueError:
    print '    ValueError: ',
    print "lastSubstr=%s  -  "%( lastSubstr ), 
    print sys.exc_info()[1]

return inputStr[start:end]

score 2 · Answer 15 · answered Feb 05 '17 at 05:59

from timeit import timeit
from re import search, DOTALL


def partition_find(string, start, end):
    return string.partition(start)[2].rpartition(end)[0]


def re_find(string, start, end):
    # applying re.escape to start and end would be safer
    return search(start + '(.*)' + end, string, DOTALL).group(1)


def index_find(string, start, end):
    return string[string.find(start) + len(start):string.rfind(end)]


# The wikitext of "Alan Turing law" article form English Wikipeida
# https://en.wikipedia.org/w/index.php?title=Alan_Turing_law&action=edit&oldid=763725886
string = """..."""
start = '==Proposals=='
end = '==Rival bills=='

assert index_find(string, start, end) \
       == partition_find(string, start, end) \
       == re_find(string, start, end)

print('index_find', timeit(
    'index_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

print('partition_find', timeit(
    'partition_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

print('re_find', timeit(
    're_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

Result:

index_find 0.35047444528454114
partition_find 0.5327825636197754
re_find 7.552149639286381

re_find was almost 20 times slower than index_find in this example.

josh · Answer 16 · 2010-07-30T06:20:22.700

1

My method will be to do something like,

find index of start string in s => i
find index of end string in s => j

substring = substring(i+len(start) to j-1)

edited Jul 30 '10 at 06:20

answered Jul 30 '10 at 05:56

josh

13,793
12
49
58

score 1 · Answer 17 · answered Jul 30 '10 at 07:16

This I posted before as code snippet in Daniweb:

# picking up piece of string between separators
# function using partition, like partition, but drops the separators
def between(left,right,s):
    before,_,a = s.partition(left)
    a,_,after = a.partition(right)
    return before,a,after

s = "bla bla blaa <a>data</a> lsdjfasdjöf (important notice) 'Daniweb forum' tcha tcha tchaa"
print between('<a>','</a>',s)
print between('(',')',s)
print between("'","'",s)

""" Output:
('bla bla blaa ', 'data', " lsdjfasdj\xc3\xb6f (important notice) 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f ', 'important notice', " 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f (important notice) ', 'Daniweb forum', ' tcha tcha tchaa')
"""

score 1 · Answer 18 · answered Oct 05 '17 at 00:39

Parsing text with delimiters from different email platforms posed a larger-sized version of this problem. They generally have a START and a STOP. Delimiter characters for wildcards kept choking regex. The problem with split is mentioned here & elsewhere - oops, delimiter character gone. It occurred to me to use replace() to give split() something else to consume. Chunk of code:

nuke = '~~~'
start = '|*'
stop = '*|'
julien = (textIn.replace(start,nuke + start).replace(stop,stop + nuke).split(nuke))
keep = [chunk for chunk in julien if start in chunk and stop in chunk]
logging.info('keep: %s',keep)

Akshay · Answer 19 · 2018-04-18T09:34:18.173

0

Further from Nikolaus Gradwohl answer, I needed to get version number (i.e., 0.0.2) between('ui:' and '-') from below file content (filename: docker-compose.yml):

    version: '3.1'
services:
  ui:
    image: repo-pkg.dev.io:21/website/ui:0.0.2-QA1
    #network_mode: host
    ports:
      - 443:9999
    ulimits:
      nofile:test

and this is how it worked for me (python script):

import re, sys

f = open('docker-compose.yml', 'r')
lines = f.read()
result = re.search('ui:(.*)-', lines)
print result.group(1)


Result:
0.0.2

edited Apr 18 '18 at 09:34

answered Apr 18 '18 at 09:29

Akshay

169
1
4

Using Docker for simple task is bad practice. – Dmitry Bubnenkov May 16 '21 at 13:37
1

@DmitryBubnenkov what does the above post has to do anything with Docker usage/implementation? It's all about finding a string between two substrings in a file. – Akshay May 16 '21 at 21:37
I thought this use case was great. My use case was a css file with encoded base64 text it just shows not every text file needs to be .txt – digitaluniverse Aug 07 '22 at 05:19

score -3 · Answer 20 · answered Apr 11 '17 at 02:53

-3

This seems much more straight forward to me:

import re

s = 'asdf=5;iwantthis123jasd'
x= re.search('iwantthis',s)
print(s[x.start():x.end()])

answered Apr 11 '17 at 02:53

Chris Martin

19

This requires you to know the string you're looking for, it doesn't find whatever string is between the two substrings, as the OP requested. The OP wants to be able to get the middle no matter what it is, and this answer would require you to know the middle before you start. – Korzak May 09 '19 at 20:22

Find string between two substrings

20 Answers20

Linked

Related