1

Assume that we have:

  1. ABC_ANY_STRING_DEF
  2. ANY_STRING
  3. ANY_STRING_DEF
  4. ABC_CDE_ANY_STRING_DEF

"ABC_" or "CDE_" can be prefix or absent. In addition, "_DEF" can be postfix or absent.

In this case, can I extract ANY_STRING (which is just any set of characters, just a string) between prefix and postfix by using one regular expression?

For example, input = "ABC_CDE_I like an apple_DEF", then output must be "I like an apple".

I tried the following code, but it does not output what I expected.

re.compile("(?:ABC_|CDE_)*(\S+)(?:_DEF)?")

or

re.compile("(?:ABC_|CDE_)*(\S+)(?:_DEF)*")

Thanks a lot in advance for your advice.

  • What is your goal? please provide example of both input and output – Tom Ron Jul 24 '18 at 07:41
  • Is `ANY_STRING` a chunk of two strings that are joined with one `_`? Or can it be just `anyStrIng`? A real life example would help. – Wiktor Stribiżew Jul 24 '18 at 07:43
  • @Wiktor ANY_STRING means just any string like expressed by \S+ – Joontae Kim Jul 24 '18 at 07:45
  • its too generalised to call it any string, any string can be unwanted like ABC_CDE in the end, how can one differentiate on what is required, regex is requirement specific. There has to be a pattern of some sorts eg @ or always starts with _A etc – Inder Jul 24 '18 at 07:46
  • @Tom Ron For example, input = "abc_cde_i_like_an_apple_def", then output = "i_like_an_apple" – Joontae Kim Jul 24 '18 at 07:47
  • 1
    @Inder "ABC_" or "CDE_" can be prefix, and "_DEF" can be postfix. I'd like to extract the string between prefix and postfix. But they can exist or do not exist. – Joontae Kim Jul 24 '18 at 07:48
  • That helps please add that information in the question – Inder Jul 24 '18 at 07:49

1 Answers1

2

You may use

(?:ABC_|CDE_|^)+(\S*?)(?:_DEF|$)

See the regex demo

Details

  • (?: - start of a non-capturing group that matches any of the subpatterns separated with the alternation operator |:
    • ABC_ - a literal substring ABC_
    • | - or
    • CDE_ - a literal substring CDE_
    • | - or
    • ^ - start of string
  • )+ - one or more consecutive occurrences, as many as possible (+ is a greedy quantifier)
  • (\S*?) - Capturing group 1: zero or more chars other than whitespace, but as few as possible due to the *? lazy quantifier
  • (?:_DEF|$) - either _DEF or (|) end of string ($).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • what does this | expression does eg in ABC_|CDE_| ?? – Inder Jul 24 '18 at 07:52
  • 1
    @Inder `|` is an [**alternation operator**](https://www.regular-expressions.info/alternation.html). – Wiktor Stribiżew Jul 24 '18 at 07:52
  • @Wiktor Can I know why (?:ABC_|CDE_|^)+(\S*)(?:_DEF|$) does not work properly? Is there something in "?" keyword to make it work? – Joontae Kim Jul 24 '18 at 08:00
  • @JoontaeKim `\S*` (*0+ non-whitespace chars*) is a greedily quantified pattern and `\S` matches `_`, `D`, `E` and `F`. So, `\S*` grabs all non-whitespace chars up to the end of string here, and checks if the `(?:_DEF|$)` can match there. Yes, the `$` matches the end of string, so Group 1 holds all the text it grabbed and the regex engine returns a valid match. See [this debugger page](https://regex101.com/r/CAsHTH/2/debugger). – Wiktor Stribiżew Jul 24 '18 at 08:03
  • @Wiktor I see. "?" seems to give a chance to check "_DEF" with assumption of absence of "\S*". Thanks a lot!! – Joontae Kim Jul 24 '18 at 08:06
  • @JoontaeKim: Accept the answer if it helped you (green tick on the left). – Jan Jul 24 '18 at 08:32