How to extract the substring between two markers?

Question

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.

I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

How to do the same thing in Python?

one liner with python 3.8 `text[text.find(start:='AAA')+len(start):text.find('ZZZ')]` — cookiemonster, Jun 18 '21 at 19:19

score 861 · Accepted Answer · edited Oct 08 '13 at 15:50

861

Using regular expressions - documentation for further reference

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

or:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234

edited Oct 08 '13 at 15:50

CDMP

310
4
10

answered Jan 12 '11 at 09:18

eumiro

207,213
34
299
261

37

The second solution is better, if the pattern matches most of the time, because its [Easier to ask for forgiveness than permission.](http://docs.python.org/3/glossary.html#term-eafp). – Bengt Jan 14 '13 at 16:11
8

Doesn't the indexing start at 0? So you would need to use group(0) instead of group(1)? – Alexander Nov 08 '15 at 22:16
29

@Alexander, no, group(0) will return full matched string: AAA1234ZZZ, and group(1) will return only characters matched by first group: 1234 – Yurii K Nov 12 '15 at 13:46
2

@Bengt: Why is that? The first solution looks quite simple to me, and it has fewer lines of code. – HelloGoodbye Jul 07 '16 at 13:21
Why use `.+?` and not `.*`? Aren't they rather equivalent? – HelloGoodbye Jul 07 '16 at 13:27
@HelloGoodbye The link I gave hints it: "clean and fast". Basically, it makes exception handling obvious and saves one conditional jump. – Bengt Jul 07 '16 at 22:50
@HelloGoodbye `.+` **must** match something, e.g. the numbers in this example. `.*` **can** match something, but does not have to. E.g. it would match *gfgfdAAAZZZuijjk* too. – whirlwin Apr 27 '17 at 06:41
@whirlwin, yes, but I wrote `.+?`, not `.+`. Since the `.+` is followed by a `?`, doesn't that mean that `.+` is optional? I.e. the `.` can occur any number of times, including zero (which is why I think `.+?` should be equivalent to `.*`)? Otherwise, what role does the `?` play in this expression? – HelloGoodbye Jun 11 '17 at 12:52
9

In this expression the ? modifies the + to be non-greedy, ie. it will match any number of times from 1 upwards but as few as possible, only expanding as necessary. without the ?, the first group would match gfgfAAA2ZZZkeAAA43ZZZonife as 2ZZZkeAAA43, but with the ? it would only match the 2, then searching for multiple (or having it stripped out and search again) would match the 43. – Heather Jul 19 '17 at 08:31
@Bengt: If you go into the exception often, then the if would have been faster. Basically, if you think that it the condition will be met most of the time, go with exception-handling, but if you think it will not be met most of the time, go with `if`. While "try-catch" is considered pythonic by some, in general software engineering it is actually considered to be not great, because it makes functions non-pure. – Make42 Jul 02 '21 at 08:23
@eumiro thanks for your answer. If the interested part is sure to be decimals then one can use following: (\d+) instead of (.+?). It is working for me. – Vivek Agrawal Dec 08 '21 at 13:19
what is the approach if there are multiple matches in string something like 'gfgfdAAA1234ZZZuijjkAAA2672ZZZasdcdAAA862ZZZasxawcwcw'? I want to extract every in-between string – Shubhank Gupta Mar 10 '22 at 18:18
@ShubhankGupta In that case, you can use `re.findall` to find all pattern appearing in that string. Check [docs](https://docs.python.org/3/library/re.html#:~:text=an%20empty%20string.-,re.findall,-(pattern%2C) for more details – JonnyJack Jul 07 '22 at 04:05

score 159 · Answer 2 · answered Jan 12 '11 at 09:17

159

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.

answered Jan 12 '11 at 09:17

Lennart Regebro

167,292
41
224
251

11

The question seems to imply that the input text will always contain both "AAA" and "ZZZ". If this is not the case, your answer fails horribly (by that I mean it returns something completely wrong instead of an empty string or throwing an exception; think "hello there" as input string). – tzot Feb 06 '11 at 23:46
@user225312 Is the `re` method not faster though? – confused00 Jul 21 '16 at 09:25
2

Voteup, but I would use "x = 'AAA' ; s.find(x) + len(x)" instead of "s.find('AAA') + 3" for maintainability. – Alex Jun 21 '17 at 08:47
1

If any of the tokens can't be found in the `s`, `s.find` will return `-1`. the slicing operator `s[begin:end]` will accept it as valid index, and return undesired substring. – ribamar Aug 28 '17 at 15:44
@confused00 find is much faster than re https://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match-search-or-str-find – Claudiu Creanga May 03 '20 at 19:30

score 128 · Answer 3 · answered Feb 06 '11 at 23:43

128

regular expression

import re

re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)

The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text

string methods

your_text.partition("AAA")[2].partition("ZZZ")[0]

The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.

PS Python Challenge?

answered Feb 06 '11 at 23:43

tzot

92,761
29
141
204

8

This answer probably deserves more up votes. The string method is the most robust way. It does not need a try/except. – ChaimG Dec 03 '15 at 02:59
... nice, though limited. partition is not regex based, so it only works in this instance because the search string was bounded by fixed literals – GreenAsJade Feb 29 '16 at 02:07
Great, many thanks! - this works for strings and does not require regex – Alex Jun 08 '18 at 11:53
Upvoting for the string method, there is no need for regex in something this simple, most languages have a library function for this – Harry Jones Mar 15 '22 at 16:09

score 43 · Answer 4 · answered Feb 09 '19 at 16:57

43

Surprised that nobody has mentioned this which is my quick version for one-off scripts:

>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'

answered Feb 09 '19 at 16:57

Uncle Long Hair

2,719
3
23
33

1

@user1810100 mentioned essentially that almost exactly 5 years to the day before you posted this... – John Mar 12 '19 at 18:50
1

Adding an `if s.find("ZZZ") > s.find("AAA"):` to it, avoids issues if 'ZZZ` isn't in the string, which would return `'1234uuijjk'` – Rolf of Saxony Nov 14 '20 at 20:42
@tzot's answer (https://stackoverflow.com/a/4917004/358532) with `partition` instead of `split` seems more robust (depending on your needs), as it returns an empty string if one of the substrings isn't found. – Yann Dìnendal Sep 16 '21 at 15:33

score 25 · Answer 5 · answered Jan 11 '18 at 11:39

25

you can do using just one line of code

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

result will receive list...

answered Jan 11 '18 at 11:39

Mahesh Gupta

1,882
12
16

score 18 · Answer 6 · answered Jan 12 '11 at 09:18

18

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

answered Jan 12 '11 at 09:18

infrared

3,566
2
25
37

4

`AttributeError: 'NoneType' object has no attribute 'groups'` - if there is no AAA, ZZZ in the string... – eumiro Jan 12 '11 at 09:20

score 12 · Answer 7 · answered Jan 12 '11 at 09:19

12

You can use re module for that:

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

answered Jan 12 '11 at 09:19

andreypopp

6,887
5
26
26

score 11 · Answer 8 · answered Mar 14 '18 at 09:11

11

In python, extracting substring form string can be done using findall method in regular expression (re) module.

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

answered Mar 14 '18 at 09:11

rashok

12,790
16
88
100

score 7 · Answer 9 · answered Mar 04 '19 at 01:31

7

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

print(text[text.index(left)+len(left):text.index(right)])

Gives

string

answered Mar 04 '19 at 01:31

Fernando Wittmann

1,991
20
16

If the text does not include the markers, throws a ValueError: substring not found exception. That is good, – plpsanchez Apr 19 '22 at 03:46

score 6 · Answer 10 · edited Feb 11 '14 at 09:23

6

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

edited Feb 11 '14 at 09:23

Ashwini Chaudhary

244,495
58
464
504

answered Feb 08 '14 at 00:12

user1810100

63
1
4

score 6 · Answer 11 · answered Jan 31 '15 at 08:29

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA$.*$ZZZ.*|\1|"

And this will give me 1234 as a result.

You could do the same with re.sub function using the same regex.

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by $..$, but in python it was represented by (..).

Saeed Zahedian Abroodi · Answer 12 · 2017-10-21T05:38:35.070

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

score 5 · Answer 13 · answered Jan 08 '20 at 23:03

5

Using PyParsing

import pyparsing as pp

word = pp.Word(pp.alphanums)

s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
    print(match)

which yields:

[['1234']]

answered Jan 08 '20 at 23:03

Raphael

959
7
21

cookiemonster · Answer 14 · 2022-08-20T11:33:11.070

5

One liner with Python 3.8 if text is guaranteed to contain the substring:

text[text.find(start:='AAA')+len(start):text.find('ZZZ')]

edited Aug 20 '22 at 11:33

answered Jun 18 '21 at 19:20

cookiemonster

1,315
12
19

1

Does not work if the text does not contain the markers. – plpsanchez Apr 19 '22 at 03:39
Similar solution by fernando-wittmann using text.index throws exception, allowing detection and forgiveness. https://stackoverflow.com/a/54975532/2719980 – plpsanchez Apr 19 '22 at 03:45

score 4 · Answer 15 · edited May 23 '17 at 11:55

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.

Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.

score 1 · Answer 16 · answered Oct 05 '21 at 19:02

also, you can find all combinations in the bellow function

s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
    word_places = []
    i=0
    while True:
        word_place = text.find(word,i)
        i+=len(word)+word_place
        if i>=len(text):
            break
        if word_place<0:
            break
        word_places.append(word_place)
    return word_places
def find_all_combination(text,start,end):
    start_places = find_all_places(text,start)
    end_places = find_all_places(text,end)
    combination_list = []
    for start_place in start_places:
        for end_place in end_places:
            print(start_place)
            print(end_place)
            if start_place>=end_place:
                continue
            combination_list.append(text[start_place:end_place])
    return combination_list
find_all_combination(s,"Part","Part")

result:

['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']

score 1 · Answer 17 · answered Aug 02 '22 at 13:28

In case you want to look for multiple occurences.

content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
    spos = c.find('_Suffix')
    if spos!=-1:
        strings.append( c[:spos])
print( strings )

Or more quickly :

strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]

score 0 · Answer 18 · answered Feb 23 '19 at 18:26

Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.

def find_substring(string, start, end):
    len_until_end_of_first_match = string.find(start) + len(start)
    after_start = string[len_until_end_of_first_match:]
    return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]

score 0 · Answer 19 · answered Oct 12 '19 at 00:30

Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :

string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []

for char in string:
    if char in numbersList: output.append(char)

print(f"output: {''.join(output)}")
### output: 1234

score 0 · Answer 20 · answered Sep 04 '20 at 11:16

Typescript. Gets string in between two other strings.

Searches shortest string between prefixes and postfixes

prefixes - string / array of strings / null (means search from the start).

postfixes - string / array of strings / null (means search until the end).

public getStringInBetween(str: string, prefixes: string | string[] | null,
                          postfixes: string | string[] | null): string {

    if (typeof prefixes === 'string') {
        prefixes = [prefixes];
    }

    if (typeof postfixes === 'string') {
        postfixes = [postfixes];
    }

    if (!str || str.length < 1) {
        throw new Error(str + ' should contain ' + prefixes);
    }

    let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
    const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);

    let value = str.substring(start.pos + start.sub.length, end.pos);
    if (!value || value.length < 1) {
        throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
    }

    while (true) {
        try {
            start = this.indexOf(value, prefixes);
        } catch (e) {
            break;
        }
        value = value.substring(start.pos + start.sub.length);
        if (!value || value.length < 1) {
            throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
        }
    }

    return value;
}

score 0 · Answer 21 · answered Feb 20 '23 at 15:49

0

a simple approach could be the following:

string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]

answered Feb 20 '23 at 15:49

Anonymous

61
5

could you explain your code, so that it would be more helpful to the readers? – Simas Joneliunas Feb 24 '23 at 14:58

score 0 · Answer 22 · answered May 28 '23 at 23:21

If you want to check whether the substrings exists and return empty string if they don't:

def substr_between(str_all, first_string, last_string):
    pos1 = str_all.find(first_string)
    if pos1 < 0:
        return ""
    pos1 += len(first_string)
    pos2 = str_all[pos1:].find(last_string)
    if pos2 < 0:
        return ""
    return str_all[pos1:pos1 + pos2]

MaxLZ · Answer 23 · 2018-05-03T18:31:44.397

One liners that return other string if there was no match. Edit: improved version uses next function, replace "not-found" with something else if needed:

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

How to extract the substring between two markers?

23 Answers23

regular expression

string methods

Linked

Related