Its amazing how noone, when addressing the problem of using RegEx with HTML, confronts the problem of HTML often NOT being well-formed, thus rendering a lot of HTML-parsers completely useless.
If you are developing tools to analyze webpages and its a fact that these are not well-formed HTML, the statement "Regex should never be used to parse HTML" og "use a HTML parser" is just completely bogus. Facts are that in the real world, people create HTML as they feel like - and not necessarily suited for parsers.
RegEx is a completely valid way to find elements in text, thus in HTML. If there are any other reasonable way to confront the problems the Original Poster has, then post them instead of referring to a "use a parser" or "RTFM" statement.