1

I'm using the following regular expression to pull out some html:

(?i)(?:\<tr\s*class='list'[^\>]*\>)[^$+]*\</tr\>

Problem is its not segregating the TRs correctly. I'm trying to use $+ to reference the tag selector again to ensure that the contents of the match don't have the start tag again. Here is the sample html:

http://www.pastie.org/1311827

There are multiple <tr>s in some matches. Please help.

Daniel Vandersluis
  • 91,582
  • 23
  • 169
  • 153
chief7
  • 14,263
  • 14
  • 47
  • 80

2 Answers2

3

I don't know what you think [^$+]* means, but it defines a negated character class that matches zero or more times. In other words, it matches an empty string, or one or more characters that aren't a literal dollar sign or plus.

HTML cannot be trivially parsed by regex (unless it is known ahead of time what the structure will look like) because in order to properly parse a document you need to be able to recurse, as elements within the document can be nested within themselves (for instance a <div> can contain another <div>). While some languages (you didn't specify what you're using) support recursive regular expressions (perl and PHP for instance), it would likely be more efficient to use a proper DOM parser than recursive regex (the complexity of which non-withstanding) anyways!

Daniel Vandersluis
  • 91,582
  • 23
  • 169
  • 153
  • @Daniel, this is simply not true. [HTML certainly can be parsed with regular expressions!](http://stackoverflow.com/questions/4044946/regex-to-split-html-tags/4045840#4045840) It is just too tough in the general case to bother with when there are good parsing classes out there. However, those are ridiculous overkill when the HTML being parsed in not in fact general but specific, when you know exactly the limited possibilities. Also, this tired old refrain about HTML not being REGULAR is completely bogus and annoying. Even v7 grep parsed non-REGULAR languages: `/(.)\1/` isn’t REGULAR. So what? – tchrist Nov 19 '10 at 21:32
  • @tchrist In general, regular expressions can't handle infinite recursion (I am well aware of the fact that there are regex engines that support recursion but that is irrelevant). I am not really familiar with Perl, but it looks like you're defining a recursive expression there. Without recursion, there is no way to capture the entire div with id 1 here using regex: `
    `
    – Daniel Vandersluis Nov 19 '10 at 21:43
  • @Daniel: It is not irrelevant. It is completely relevant. But we have no idea what language the user is using, because he didn’t follow Andy’s advice about always adding a language tag to go with the `regex` tag. But even without a recursive regex, there remain many input sets amenable to any regular expressions from v7 onwards, all of which are non-REGULAR regular expressions. It might be too hard for the lightweights, but without knowing more about the problem, it isn’t decidable. – tchrist Nov 19 '10 at 21:48
  • I will concede that I jumped to the "HTML is not regular" defense without explanation. Yes, modern regular expressions are capable of parsing non-regular language, as opposed to formal regular expression theory which only capable of expressing alternation, concatenation and the kleene star. However, again, you're talking about perl specifically and without a language tag on the question in my opinion the solution should be cross-language, in which case recursive regex isn't applicable. – Daniel Vandersluis Nov 19 '10 at 21:53
  • @Daniel: The greatest common factor between all languages using the patterns-formerly-known-as-regexes is the very most bare-boned, basic operations that Ken Thompson showed before most of this readership was even born. It has to be solvable with a DFA only, and you must not use backrefs or anything else that the original implementation disallowed. That is a stupid and unrealistic assumption, because there is not a single language out there that people regularly use that is so crippled. It is wholy the fault of the original poster if they require a specific language yet neglect to specify it. – tchrist Nov 19 '10 at 22:05
  • @Daniel: Thank you. I have reversed my vote because of your conscientious edit. – tchrist Nov 19 '10 at 22:07
  • @tchrist I don't disagree with your assertions, although it is possible to solve a lot of problems using DFA-able regexes that people use modern techniques as crutches for, usually to lesser efficiency. In any event, if updated my answer to clarify. – Daniel Vandersluis Nov 19 '10 at 22:10
  • @tchrist thanks, appreciate it. It's nice to have a good discussion about regex, and to be completely honest, I find it a bit ridiculous how much that answer about not using regex for HTML is used without explanation. – Daniel Vandersluis Nov 19 '10 at 22:15
  • @Daniel: You’re perfectly welcome. People deserve more than slap-down answers that don’t explain anything. It’s not exactly intellectually dishonest, but it does seem like it’s the wrong sort of lazy, if you know what I mean. – tchrist Nov 19 '10 at 23:18
  • Wow, seems like I hit a nerve. Thanks for all the replies. I do know the format in advance. I was really hoping someone could take my expression and try against the referenced html. I'm trying to get the TRs invidually matched for further processing. For some reason, some rows are grouped into a single match. I have other similiar expressions that work great. I must be missing something on this one. – chief7 Nov 20 '10 at 00:16
1

Use document.getElementsByTagName in your favorite DOM library and iterate through the nodeList with a loop, then parse the getAttribute('class').

I suggest not using regex because it's only a matter of time before the regex breaks, unless you're dealing with very trivial markup, in addition DOM is just made for that purpose.

meder omuraliev
  • 183,342
  • 71
  • 393
  • 434