1

I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:

matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');

data.replace(matcher, "$1");

The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?

EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\@\\@ASSET_ID\\@\\@_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).

EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!

Community
  • 1
  • 1
Crash
  • 219
  • 6
  • 16
  • 2
    Aside from the flood of comments that are on their way about not parsing HTML with Regex (which you shouldn't do - it's not a Regular language), we are at the very least going to need to see sample data - what are you replacing, what is your start and end text, expected output, actual output, etc etc. – FrankieTheKneeMan Aug 12 '13 at 21:54
  • @FrankieTheKneeMan Perfect, I second ya. (On both waiting the flood and needing a sample data.) – acdcjunior Aug 12 '13 at 21:54
  • 3
    Don't listen to the trolls. Every tool has its time and place. I'll take a look at your question and try to help you out, give me a minute. – Suamere Aug 12 '13 at 22:07
  • 1
    @Suamere: *what* 'trolls'? The reason that posts asking about parsing HTML with regex get *lots* of (*valid*) comments about *not* parsing HTML with regex is because it's the wrong tool for the job, for precisely the reason given by Frankie. And, Crash: please post your solution as an *answer* to your question. That way it might be of benefit to other users in future (given the specificity of the regular expression this is, perhaps, unlikely, but it's never a bad thing to answer a question). – David Thomas Aug 12 '13 at 22:18
  • 1
    [Every tool has a place](http://suamere.com/Apps/Regex/ParsingHtml.aspx), (Click that) don't fall in with the trolls who blindly throw away parding HTML with Regex. – Suamere Aug 12 '13 at 22:20
  • @Suamere what a great post, respect! – acdcjunior Aug 12 '13 at 22:28
  • @Suamere, I agree that every tool has a time and a place, but when dealing with HTML, Regular Expressions quickly get out of control. Additionally, they're difficult to maintain unless you have a very specific subset of information you're looking for. *Parsing* HTML is very different from *extracting information from it*. If you require your information to have knowledge of how HTML works (for instance, "Everything within a given DIV"), Regex isn't your tool. If you just need data (like "the SRC attribute of an IMG with id='foo'") then Regular Expressions are serviceable. – FrankieTheKneeMan Aug 12 '13 at 22:30
  • At the end of the day, my post was about asking for more information about the problem - I actually detest people who just write "Use an HTML Parser!" and leave. it's not productive. – FrankieTheKneeMan Aug 12 '13 at 22:31
  • 1
    Not sure why your last edit says you added `.*` after the last parenthesis. Your situation was obviously difficult to describe, but I still suggest taking a look at my answer to help clean up your data. I'd be curious to know more about what you were dealing with and why .* solved your issue. It really shouldn't ever be used and is very slow. But if it works, use it. – Suamere Aug 12 '13 at 22:31
  • He's using `.*`s because he's issuing a `replace` statement instead of a `match`. – FrankieTheKneeMan Aug 12 '13 at 22:32
  • @Suamere Well if you know what you're doing then parsing HTML with regex in certain cases is ok. But most of the developers don't, just watch the regex tag and see how much crap questions come in "how to match digits", "how to match a set of characters", "how does this regex work" (turns out it's really simple) and the list goes on. Most of them don't even know the basics, they don't know the difference between greedy and ungreedy patterns. So for the sake of everyone, you should avoid unmaintainable regexes.Because if the input changes a slight bit,the developer after you will need to edit it – HamZa Aug 12 '13 at 22:33
  • 1
    If you're doing `source = replace .*(xxx).* with $1`, you could just state that your `original source = (xxx)`. The .*'s are very slow. – Suamere Aug 12 '13 at 22:35
  • 1
    Also, I wasn't saying Frankie is a troll. His first Comment was well said. I was referring to the possible referenced onslaught of trolls who might come in. – Suamere Aug 12 '13 at 22:38
  • @Suamere well that's unfortunately true that some people became allergic when regex is mentioned with HTML. Don't get me wrong, I'm crazy to the extent of [parsing PHP](http://stackoverflow.com/a/17134110/) with regex :) – HamZa Aug 12 '13 at 22:41

1 Answers1

3

First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:

This is my Text

And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.

That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:

<[^>]+xxx((?!zzz).)*zzz

From there I examine what it's doing.

  1. You are looking for an HTML opening Delimeter <. You consume it.
  2. You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
  3. You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
  4. The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
  5. When the backtracking fails, it will look for the closing text and gather it successfully.

The result of that logic:

<[^>]*xxx((?!zzz).)*?zzz

If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:

<[^>]*xxx.*?zzz

So for Javascript, your code would say:

matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');

I put the IgnoreCase "i" in there for good measure, but you may or may not want that.

Suamere
  • 5,691
  • 2
  • 44
  • 58
  • 1
    You may want to give a brief explanation of the difference between replace, match, and search statements in here - as I think that's what OP missed. – FrankieTheKneeMan Aug 12 '13 at 22:33
  • 1
    Right, after posting this I noticed he was doing a replace possibly incorrectly. I almost edited my answer for that, except that he didn't say the purpose of the replace or note it at all in his question. So if he used my answer, he could just say `OriginalSource = Regex Result` (Paracode). But I think he's left by now since he found his answer with `.*` – Suamere Aug 12 '13 at 22:37