0

I want to parse the following string

void int g = <span id="sentenceColor">"c int void x is "</span> + 4;

I want to find void and int that is outside the element not inside the element. I have the following regular expression.

(?<!<span id="sentenceColor">.*)((int)|(void))(?!.+(<\/span>))

I am assuming the above means find int or void with a lookbehind making sure I do not have followed by an indeterminate amount of characters. Also making sure that these words are no followed by an indeterminate amount of characters and . I've been using an online regex tester for a while and figured maybe someone has better experience with this than I.

DaveK
  • 544
  • 1
  • 6
  • 16
  • What regex have to do with html ??? – Gilles Quénot Feb 20 '18 at 16:22
  • Is this a whole input string? or input string contains this line? and what is the language? – revo Feb 20 '18 at 16:22
  • There is no language here. This is not html. I am creating html through information gained from regular expressions. The html is because this the resulting string i have at this particular point. Everything is getting handled in Java, but I am reading .txt files for this information. – DaveK Feb 20 '18 at 16:27
  • So the language is Java. Please tag your question with it. A specific language or tool should be tagged with regex questions otherwise we'd have no idea to answer. – revo Feb 20 '18 at 16:30
  • No. The language is not Java. I am using an online tool called regex101 to build the regular expression. I am currently using java but what if someone wants to use the answer here in C++, Python, or PHP, or whathaveyou. The question is not geared to any specific language but to the fundamental usage of lookahead or lookbehind in regular expressions. I'm looking for a pattern that works for regular expressions outside of any library or language. the fact I am using java is just the language I am using at the time. What if I want to use this in another language? – DaveK Feb 20 '18 at 16:48
  • Firstly - do you want to capture the `void int` that come **before** the tag? So why did you use look-behind? Secondly, it **is** important to know if you're using Javascript, because it doesn't allow lookbehind – GalAbra Feb 20 '18 at 18:56

2 Answers2

1

Your regex suffers from a few mistakes:

  1. It uses look-behind with a dynamic length string, which is invalid:

Many regex flavors, including those used by Perl, Python, and Boost only allow fixed-length strings. You can use literal text, character escapes, Unicode escapes other than \X, and character classes. You cannot use quantifiers or backreferences...

  1. You mentioned you want to match void and int, but you use the OR operator: ((int)|(void)), which will cause the regex to match only one of them.
  2. Redundant parenthesis, which create many groups (although not crucial, it's definitely not a great habit).


If you want to match the void and int inside the tag you can use this regex, that uses lookbefore properly:

(?<=<span id="sentenceColor">).*(void int|int void)

Or if you want to match those before the tag, you should use lookahead; and this would be the regex you're after:

(void int|int void).*(?=<span id="sentenceColor">)
GalAbra
  • 5,048
  • 4
  • 23
  • 42
0

Well, as long I know, you can't use quantifiers with lookbehind :/. So, your '*' will not work and cause an error. I don't know how to solve your problem yet, but will keep trying to give a solution and, at least, you know the reason why it's not working.
[EDIT]:
well, the following RegEx (\".*?\") selects the content between ' " '.
So, a solution I've come up with is to remove from the original string the result of the match with this regex and, then, simply use (int|void) on the new string.
Hope this helps.
[EDIT 2]:
below, the error Regex101 shows.
lookbehind assertion is not fixed length - offset: 31
enter image description here

Leonardo Maffei
  • 352
  • 2
  • 6
  • 16
  • 1
    Could you provide some documentation to substantiate the claim that one cannot use quantifiers with lookbehind? For what regex flavours does it hold? – Andrey Tyukin Feb 20 '18 at 18:58
  • @AndreyTyukin just pay attention on the "Match information" section at "Regex101" when you paste `(?<!.*)((int)|(void))(?!.+(<\/span>)) ` as a regex – Leonardo Maffei Feb 20 '18 at 19:05
  • 1
    Well, yes, thanks for the hint (that's a really good advice, I gonna use it). But your posting literally starts with the words "I don't know how to solve your problem". Despite that, the question of the OP is now showed in the queues as having at least one answer (therefore, it doesn't show up in the 'unanswered'-queue). Do you have an answer, or do you have no answer? I'm not the OP, I'm just a random guy from the review queue... – Andrey Tyukin Feb 20 '18 at 19:17
  • My first comment was rather a hint that you should make a waterproof argument why what the OP wants is not possible at all. Sometimes, [this, too, counts as an answer, as in this example](https://stackoverflow.com/a/1732454/2707792). However, if you just say "I don't know how to solve it", and you have 11 rep, and you have images in your answer, then the system will most likely put your answer into the review queue. Otherwise, it would make the question appear as "answered", and prevent it from being *actually* answered. – Andrey Tyukin Feb 20 '18 at 19:28
  • 1
    @AndreyTyukin thanks for the tip. You're right, will pay attention at this next time :) – Leonardo Maffei Feb 20 '18 at 19:36