Regular Expression to find a strings between two tokens, while EXCLUDING the tokens AND the start token is the same as the end token

Question

An extension of Regular Expression to find a string included between two characters, while EXCLUDING the delimiters

The solution to that question modified a tiny bit:

(?<=\#)(.*?)(?=\#)

Given a string "The #iPhone 4# is made by #apple#." that solution returns:

["iPhone 4", " is made by ", "apple"]

Now I'm not sure if this is possible using only a regex, but in this case " is made by " is not supposed to be returned. It simply happens to be squashed between the other two ## wrapped strings, and so is wrapped itself.

Clarification: The regex needs to support a variable number of #foo# strings in the parent string. There will not always be only 2.

Update

Due to the varied responses, and the realization that this problem is more simply solved without regex, I'm voting to close the question. Answer: do this without regex, in the language of your choice.

the backslash should be removed since # isn't a special character like ) was before the edit. — neydroydrec, Aug 28 '11 at 18:38
you really don't need regular expressions for this. Just search for the indices of every `#`, then iterate over the result two at a time to pull out your data (ie: first pair of indices is the first match, the second pair the second match, and so on) — Bryan Oakley, Aug 28 '11 at 21:02

tripleee · Answer 1 · 2011-08-31T07:30:58.123

The zero-width assertions cause the match to include text between all delimiters instead of continuing after each "consumed" delimiter. You have to change the code which does the matching so that it extracts, for instance, the first capture group, rather than the whole matched expression. It would help if you posted the code you are using now so we could tell you how to modify it, but your example is formatted in a Pythonesque way, so something like this;

stringlist = re.findall("#([^#]*)#", string)

Sorry, not at my computer, and my Python is not very good, so I'll probably have to get back to you with corrections.

Update: fixed and substantially simplified the code

erikH · Answer 2 · 2011-08-28T20:57:32.580

1

Very close to @Gerben, but for me working: (there should be an odd amount of '#' before the token (incl. the '#' that starts the token))

(?<=^[^#]*#([^#]*#[^#]*#)*)([^#]*)(?=#)

You can't just take (?<=\#)(.*?)(?=\#) and ignore every other match in the match list before processing on...?

edited Aug 28 '11 at 20:57

answered Aug 28 '11 at 20:48

erikH

2,286
1
17
19

1

Visualize with silverlight : http://regexhero.net/tester/?id=cf8a96b7-01bc-4867-bf5d-cea8547b106b – erikH Aug 28 '11 at 20:52
1

+1 for getting this to work and providing a demo, but it still won't work in most flavors (see the section "Important Notes About Lookbehind" on [this page](http://www.regular-expressions.info/lookaround.html) for the reason). But if you move the extra check into the lookahead - `(?<=#)[^#]*(?=#[^#]*(?:#[^#]*#[^#]*)*$)` - it should work in any flavor that supports lookbehind. – Alan Moore Aug 28 '11 at 21:58
Good to know. Lookahead is to prefer, over lookbehind. – erikH Aug 28 '11 at 22:14

Dewfy · Answer 3 · 2011-08-28T18:45:20.467

0

Instead of .* use [^\]*] (in case when ] is dellimeter

EDITED

So you have a list #text#,#text#,.. and want to resolve items of list

(\#[^\#]*\#[,$])+

edited Aug 28 '11 at 18:45

answered Aug 28 '11 at 18:18

Dewfy

23,277
13
73
121

Sorry, I forgot to update the question. # is the delimiter which makes it more difficult than when using a different delimiter for start and end []. Any idea how to work it when the delimiter for start and end are the same, as in the case of #foo#? – Marc Aug 28 '11 at 18:26
Nope, a string like "Some #string# with a #bunch# of random #enclosed# smaller strings #inside#". The regex on that string should return ["string", "bunch", "enclosed", "inside"]. – Marc Aug 28 '11 at 18:49
Ok, in this case you should deal with concept of groups - indexed elements inside brackets. Numbering is performed from outer to inner beggining is #1. So final regex for your example is `([^\#]*(\#([^\#]*)\#[^\#]*))*` you need deal with group #3 – Dewfy Aug 29 '11 at 07:41
can we do something like this in Ruby? – Anusha Oct 10 '18 at 10:52
@Anusha as I know (and there is an evidence https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines) Ruby fully supports this syntax – Dewfy Oct 10 '18 at 15:54

score 0 · Answer 4 · answered Aug 28 '11 at 18:18

0

The solution doesn't return what you say it does (it's working on square brackets rather than hash marks), but it's a question of what you put into parentheses; the parentheses are what direct the capturing.

#([^#]*)#[^#]*#([^#]*)#

answered Aug 28 '11 at 18:18

tripleee

175,061
34
275
318

I'm sorry, I forgot to update the regex from the other question to what I have currently for the # delimiters. Updated. – Marc Aug 28 '11 at 18:21
My answer assumes there is a single pass, rather than a matching loop which collects all the collected groups. Perhaps you should update your question again if that is not acceptable (say, if you want this for a variable number of enclosed strings, rather than exactly two). – tripleee Aug 28 '11 at 18:25

neydroydrec · Answer 5 · 2011-08-28T19:10:01.223

0

I am not familiar enough with regular expressions to give you a regular expression answer. But it seems that every second item of your list is to be discarded. Why not iterate the list and do that?

This is how I would do it:

text = "The #iPhone 4# is made by #apple#" 
cleanlist = list(match.strip('#') for match in re.findall('#.*?#', text, re.UNICODE))
print cleanlist
>>> ['iPhone 4', 'apple']

edited Aug 28 '11 at 19:10

answered Aug 28 '11 at 18:18

neydroydrec

6,973
9
57
89

This is fairly reasonable. If I can't find a pure regex solution I'll likely end up going this route. – Marc Aug 28 '11 at 18:26
1

You could as well look for your regex including the hash and then strip them of it. – neydroydrec Aug 28 '11 at 18:37
@Marc: see my edit (I used Python). – neydroydrec Aug 28 '11 at 19:11

score 0 · Answer 6 · answered Aug 28 '11 at 19:02

not sure if this works, but the idea would be that it only matches the first # if there are an even amount of #-characters before it.

(?<=(?:^[^#]*#[^#]*#)*#)([^#]*)(?=#)

But what language are you using? Because it would be a lot easier to do without using just regex

Regular Expression to find a strings between two tokens, while EXCLUDING the tokens AND the start token is the same as the end token

Update

6 Answers6