518

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.

I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

How to do the same thing in Python?

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
ria
  • 7,198
  • 6
  • 29
  • 35

23 Answers23

861

Using regular expressions - documentation for further reference

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

or:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234
CDMP
  • 310
  • 4
  • 10
eumiro
  • 207,213
  • 34
  • 299
  • 261
  • 37
    The second solution is better, if the pattern matches most of the time, because its [Easier to ask for forgiveness than permission.](http://docs.python.org/3/glossary.html#term-eafp). – Bengt Jan 14 '13 at 16:11
  • 8
    Doesn't the indexing start at 0? So you would need to use group(0) instead of group(1)? – Alexander Nov 08 '15 at 22:16
  • 29
    @Alexander, no, group(0) will return full matched string: AAA1234ZZZ, and group(1) will return only characters matched by first group: 1234 – Yurii K Nov 12 '15 at 13:46
  • 2
    @Bengt: Why is that? The first solution looks quite simple to me, and it has fewer lines of code. – HelloGoodbye Jul 07 '16 at 13:21
  • Why use `.+?` and not `.*`? Aren't they rather equivalent? – HelloGoodbye Jul 07 '16 at 13:27
  • @HelloGoodbye The link I gave hints it: "clean and fast". Basically, it makes exception handling obvious and saves one conditional jump. – Bengt Jul 07 '16 at 22:50
  • @HelloGoodbye `.+` **must** match something, e.g. the numbers in this example. `.*` **can** match something, but does not have to. E.g. it would match *gfgfdAAAZZZuijjk* too. – whirlwin Apr 27 '17 at 06:41
  • @whirlwin, yes, but I wrote `.+?`, not `.+`. Since the `.+` is followed by a `?`, doesn't that mean that `.+` is optional? I.e. the `.` can occur any number of times, including zero (which is why I think `.+?` should be equivalent to `.*`)? Otherwise, what role does the `?` play in this expression? – HelloGoodbye Jun 11 '17 at 12:52
  • 9
    In this expression the ? modifies the + to be non-greedy, ie. it will match any number of times from 1 upwards but as few as possible, only expanding as necessary. without the ?, the first group would match gfgfAAA2ZZZkeAAA43ZZZonife as 2ZZZkeAAA43, but with the ? it would only match the 2, then searching for multiple (or having it stripped out and search again) would match the 43. – Heather Jul 19 '17 at 08:31
  • @Bengt: If you go into the exception often, then the if would have been faster. Basically, if you think that it the condition will be met most of the time, go with exception-handling, but if you think it will not be met most of the time, go with `if`. While "try-catch" is considered pythonic by some, in general software engineering it is actually considered to be not great, because it makes functions non-pure. – Make42 Jul 02 '21 at 08:23
  • @eumiro thanks for your answer. If the interested part is sure to be decimals then one can use following: (\d+) instead of (.+?). It is working for me. – Vivek Agrawal Dec 08 '21 at 13:19
  • what is the approach if there are multiple matches in string something like 'gfgfdAAA1234ZZZuijjkAAA2672ZZZasdcdAAA862ZZZasxawcwcw'? I want to extract every in-between string – Shubhank Gupta Mar 10 '22 at 18:18
  • @ShubhankGupta In that case, you can use `re.findall` to find all pattern appearing in that string. Check [docs](https://docs.python.org/3/library/re.html#:~:text=an%20empty%20string.-,re.findall,-(pattern%2C) for more details – JonnyJack Jul 07 '22 at 04:05
159
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • 11
    The question seems to imply that the input text will always contain both "AAA" and "ZZZ". If this is not the case, your answer fails horribly (by that I mean it returns something completely wrong instead of an empty string or throwing an exception; think "hello there" as input string). – tzot Feb 06 '11 at 23:46
  • @user225312 Is the `re` method not faster though? – confused00 Jul 21 '16 at 09:25
  • 2
    Voteup, but I would use "x = 'AAA' ; s.find(x) + len(x)" instead of "s.find('AAA') + 3" for maintainability. – Alex Jun 21 '17 at 08:47
  • 1
    If any of the tokens can't be found in the `s`, `s.find` will return `-1`. the slicing operator `s[begin:end]` will accept it as valid index, and return undesired substring. – ribamar Aug 28 '17 at 15:44
  • @confused00 find is much faster than re https://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match-search-or-str-find – Claudiu Creanga May 03 '20 at 19:30
128

regular expression

import re

re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)

The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text

string methods

your_text.partition("AAA")[2].partition("ZZZ")[0]

The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.

PS Python Challenge?

tzot
  • 92,761
  • 29
  • 141
  • 204
  • 8
    This answer probably deserves more up votes. The string method is the most robust way. It does not need a try/except. – ChaimG Dec 03 '15 at 02:59
  • ... nice, though limited. partition is not regex based, so it only works in this instance because the search string was bounded by fixed literals – GreenAsJade Feb 29 '16 at 02:07
  • Great, many thanks! - this works for strings and does not require regex – Alex Jun 08 '18 at 11:53
  • Upvoting for the string method, there is no need for regex in something this simple, most languages have a library function for this – Harry Jones Mar 15 '22 at 16:09
43

Surprised that nobody has mentioned this which is my quick version for one-off scripts:

>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'
Uncle Long Hair
  • 2,719
  • 3
  • 23
  • 33
  • 1
    @user1810100 mentioned essentially that almost exactly 5 years to the day before you posted this... – John Mar 12 '19 at 18:50
  • 1
    Adding an `if s.find("ZZZ") > s.find("AAA"):` to it, avoids issues if 'ZZZ` isn't in the string, which would return `'1234uuijjk'` – Rolf of Saxony Nov 14 '20 at 20:42
  • @tzot's answer (https://stackoverflow.com/a/4917004/358532) with `partition` instead of `split` seems more robust (depending on your needs), as it returns an empty string if one of the substrings isn't found. – Yann Dìnendal Sep 16 '21 at 15:33
25

you can do using just one line of code

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

result will receive list...

Mahesh Gupta
  • 1,882
  • 12
  • 16
18
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
infrared
  • 3,566
  • 2
  • 25
  • 37
  • 4
    `AttributeError: 'NoneType' object has no attribute 'groups'` - if there is no AAA, ZZZ in the string... – eumiro Jan 12 '11 at 09:20
12

You can use re module for that:

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)
andreypopp
  • 6,887
  • 5
  • 26
  • 26
11

In python, extracting substring form string can be done using findall method in regular expression (re) module.

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']
rashok
  • 12,790
  • 16
  • 88
  • 100
7
text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

print(text[text.index(left)+len(left):text.index(right)])

Gives

string
Fernando Wittmann
  • 1,991
  • 20
  • 16
  • If the text does not include the markers, throws a ValueError: substring not found exception. That is good, – plpsanchez Apr 19 '22 at 03:46
6
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
user1810100
  • 63
  • 1
  • 4
6

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

You could do the same with re.sub function using the same regex.

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
5

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))
5

Using PyParsing

import pyparsing as pp

word = pp.Word(pp.alphanums)

s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
    print(match)

which yields:

[['1234']]

Raphael
  • 959
  • 7
  • 21
5

One liner with Python 3.8 if text is guaranteed to contain the substring:

text[text.find(start:='AAA')+len(start):text.find('ZZZ')]
cookiemonster
  • 1,315
  • 12
  • 19
4

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.

Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.

Community
  • 1
  • 1
Denis Kutlubaev
  • 15,320
  • 6
  • 84
  • 70
1

also, you can find all combinations in the bellow function

s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
    word_places = []
    i=0
    while True:
        word_place = text.find(word,i)
        i+=len(word)+word_place
        if i>=len(text):
            break
        if word_place<0:
            break
        word_places.append(word_place)
    return word_places
def find_all_combination(text,start,end):
    start_places = find_all_places(text,start)
    end_places = find_all_places(text,end)
    combination_list = []
    for start_place in start_places:
        for end_place in end_places:
            print(start_place)
            print(end_place)
            if start_place>=end_place:
                continue
            combination_list.append(text[start_place:end_place])
    return combination_list
find_all_combination(s,"Part","Part")

result:

['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']
yunus
  • 33
  • 1
  • 9
1

In case you want to look for multiple occurences.

content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
    spos = c.find('_Suffix')
    if spos!=-1:
        strings.append( c[:spos])
print( strings )

Or more quickly :

strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]
Adrien Mau
  • 181
  • 5
0

Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.

def find_substring(string, start, end):
    len_until_end_of_first_match = string.find(start) + len(start)
    after_start = string[len_until_end_of_first_match:]
    return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]
Foobar
  • 7,458
  • 16
  • 81
  • 161
0

Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :

string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []

for char in string:
    if char in numbersList: output.append(char)

print(f"output: {''.join(output)}")
### output: 1234
Julio S.
  • 944
  • 1
  • 12
  • 26
0

Typescript. Gets string in between two other strings.

Searches shortest string between prefixes and postfixes

prefixes - string / array of strings / null (means search from the start).

postfixes - string / array of strings / null (means search until the end).

public getStringInBetween(str: string, prefixes: string | string[] | null,
                          postfixes: string | string[] | null): string {

    if (typeof prefixes === 'string') {
        prefixes = [prefixes];
    }

    if (typeof postfixes === 'string') {
        postfixes = [postfixes];
    }

    if (!str || str.length < 1) {
        throw new Error(str + ' should contain ' + prefixes);
    }

    let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
    const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);

    let value = str.substring(start.pos + start.sub.length, end.pos);
    if (!value || value.length < 1) {
        throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
    }

    while (true) {
        try {
            start = this.indexOf(value, prefixes);
        } catch (e) {
            break;
        }
        value = value.substring(start.pos + start.sub.length);
        if (!value || value.length < 1) {
            throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
        }
    }

    return value;
}
Sergey Gurin
  • 1,537
  • 15
  • 14
0

a simple approach could be the following:

string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]
Anonymous
  • 61
  • 5
0

If you want to check whether the substrings exists and return empty string if they don't:

def substr_between(str_all, first_string, last_string):
    pos1 = str_all.find(first_string)
    if pos1 < 0:
        return ""
    pos1 += len(first_string)
    pos2 = str_all[pos1:].find(last_string)
    if pos2 < 0:
        return ""
    return str_all[pos1:pos1 + pos2]
Feng Jiang
  • 1,776
  • 19
  • 25
-1

One liners that return other string if there was no match. Edit: improved version uses next function, replace "not-found" with something else if needed:

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )
MaxLZ
  • 89
  • 1
  • 4