-1

I'm trying to match all text (including special chars and markup tags) between two tags, but, when there are two matches on the same line, the regex considers as one match.

I stopped with this expression:

(?<=<br><i>)[^<\/i>].*(?=<\/i><br>)

Beginning tag:

<br><i>

End tag:

</i><br>

It works with an html containing this:

<br><i>"hello olá - ok@tchau"</i><br>  
<br><i>"another text"</i><br>

But with this html it doesn't work:

<br><i>"hello"</i><br><br><i>"ok"</i><br>

https://regex101.com/r/kHd2z2/1

danilo
  • 7,680
  • 7
  • 43
  • 46

2 Answers2

1

(?<=<br><i>)[^<\/i>](.*?)(?=<\/i><br>)

Specifically, notice the (.*?), which makes the star lazy rather than greedy so that it will only match what is inside the tags in the shortest way possible.

See here: https://regex101.com/r/QKl0uN/1

coreyp_1
  • 319
  • 1
  • 8
1

Why you should not use regex for parsing html / xml

Generally, it is not a good idea to use regex for anything like html or xml parsing. It is better to just write a script that in a way simulates "real parsing" (e.g. you use built-in functions to parse parts of the string) which is most of the time sufficient or use "real parsing" like going through all the characters one by one.

Regex is harder to alter and expand because it is often a very specific and tight use-case and harder to understand in general.

Besides, regex is quite poor performance-wise. If you use this code very frequently I would suggest you write a simple script to do the job. JavaScript has some indexOf and lastIndexOf methods that can help a ton.

Alternative solution

How about the following:

function matchBetween(openDelimiter, closeDelimiter, input, stripArray = []) {
  const from = input.indexOf(openDelimiter);
  let result = '';
  if (from !== -1) {
    const to = input.lastIndexOf(closeDelimiter);
    if (to !== -1) {
      result = input.substring(from + openDelimiter.length, to);
      for (let i = 0; i < stripArray.length; i++) {
        result = result.replaceAll(stripArray[i], '');
      }
    }
  }
  return result;
}

const examples = [
  '<br><i>"hello olá - ok@tchau"</i><br>',
  '<br><i>"another text"</i><br>',
  '<br><i>"hello"</i><br><br><i>"ok"</i><br>'
];

const openDelimiter = '<br><i>';
const closeDelimiter = '</i><br>';
const stripArray = [openDelimiter, closeDelimiter];

for (let i = 0; i < examples.length; i++) {
  console.log('#' + i, matchBetween(openDelimiter, closeDelimiter, examples[i], stripArray));
}

It is just a very simple example, but most of the time a function like that is already sufficient. Also you can easily extend the functionality as you go.

F. Müller
  • 3,969
  • 8
  • 38
  • 49
  • thanks, I'm starting learning regex, I thought regex was useful in this case, really, regex isn't for that job, I had already done a function, I will use it again. – danilo Feb 09 '21 at 16:04
  • @danilo Yep. Well, it will work for small simple examples but you don't want to use it in a context-free language like (x)html. The biggest problem is that you have tons of use-cases to consider - and they can break the code quite easily - if you want to capture all of the exceptions and stuff in a regex you are screwed. (further reading: https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not, https://medium.com/thecyberfibre/stop-parsing-x-html-with-regular-expression-2cf13215b411) – F. Müller Feb 09 '21 at 21:09