1

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).

I have constructed this regex:

(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)

It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).

Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?

So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.

This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).

I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).

Any suggestions?

Update: Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).

General Grievance
  • 4,555
  • 31
  • 31
  • 45
d7samurai
  • 3,086
  • 2
  • 30
  • 43
  • You can surround short code blocks with `\`` characters. – SLaks Oct 17 '10 at 00:52
  • The regex does a beautiful job of selecting just the right portions - probably more elegantly than using any DOM et al method (or..?). So except for the "inversion" part, I'm pretty happy with using regex - it's very compact, code wise. I have two candidate methods to make this work: To add an extra piece to the regex (`|(?P[^<>]*)`) that actually will select the leftover text as an isolated match - and since that capture group has a name, it can be tested for in the ensuing iteration. This works, except that I noticed it also picked up just a couple of other "matches" that baffled me. – d7samurai Oct 17 '10 at 01:13
  • The other possibility (I haven't thought this through, but it should work, although it's the cumbersomeness I wanted to avoid) is to use the regex in the main post above, that leaves out the text parts - and "manually" track the matches. Since they both let me know where in the page the match started, as well as the length of the matched string, I would then use the difference between index+length of one match and the index of the next match to determine what would then represent a pure text portion of the document, not caught up in any of the 'masks'. – d7samurai Oct 17 '10 at 01:16
  • This is so that I can do my word searching / replacement in the same operation as the match iteration, and the resulting document would be ready, including all the right html and script code. – d7samurai Oct 17 '10 at 01:19
  • Update: I'm not baffled any more. Turns out it was my fault for not converting RegexBuddy syntax correctly over to .Net regex syntax (they differ in how they name capture groups, and I overlooked one when I was changing them). So the routine works perfectly. And I can do it in a matter of 10 lines of code. – d7samurai Oct 17 '10 at 03:56
  • Update: My solution to this is posted as an answer below. – d7samurai Oct 17 '10 at 05:06

6 Answers6

2

what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.

That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.

It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.

Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.

I would like to once again suggest the benefits of HTML Agility Pack.

ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.

<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
<a href="~/link"></a> - very common URL char missing in group
<a href="link$!*'link"></a> - more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
    ="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: u&#114;l('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)

and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Regarding the "kind of works" thing: It baffles me because the regex *correctly* matches everything it's supposed to when I test it in RegexBuddy. It's only during testing in .Net that a couple of script / style strings show up. – d7samurai Oct 17 '10 at 01:22
  • 3
    Don't know what would cause a difference between the two regex engines for this—maybe default case sensitivity options?—but the expression as-is will fall over for common valid HTML constructs that might not be in your test set. To match a tag you really need at least to go in-depth on attribute value delimiters. At which point you end up with a big unwieldy regex that really isn't preferable to an HTML parser in any way. – bobince Oct 17 '10 at 01:39
  • But as I said, I'm not interested in the html itself - at all. In fact, it's essential that all the html is left untouched (except for the links, which are processed to be converted from relative to absolute where necessary - but that's easy). This is why it's a very clean and efficient way (especially code-wise) to just get all the html/tags/comments/scripts *masked out* in one fell swoop by the regex, and then just performing word replacement on the text during match iteration. – d7samurai Oct 17 '10 at 01:45
  • Here's an example: ``. This is part of the html document (as in the rendered source, retrieved through "view source" of the page in question). I'm using the exact same document in RegexBuddy as I am in my VS2010 project. This particular part of the document is correctly picked up by the regex in RegexBuddy. In my app, it shows up as not (although it's stripped down to `@import url(/layouts/Standard/styles/stylesGlobal-min.css);`. It's like you say, a discrepancy between the two engines. – d7samurai Oct 17 '10 at 02:08
  • I'm such an idiot! Since RegexBuddy and .Net have different syntax for declaring capture group names, I had overlooked one of the tags when I converted from RegexBuddy. That solved that problem - and the regex works like a charm now. So what I do is I iterate through the match collection and only processing matches that has content in it's "text" capture group. And I can do this in almost no amount of code at all. – d7samurai Oct 17 '10 at 02:49
  • 2
    Ah, yeah... advanced features like this are often different across regex engines. Seriously, though, I don't think you realise what you're getting yourself into. You keep using words like “clean” and “elegant”, but this is really anything but that. Detecting `a href` attributes with regex is absolutely *not* simple, you would need to take apart a tag by quoted or unquoted attributes just to begin with. One piece of malformed markup or a `>` where you don't expect it and the results will fall apart. This can only ever work for *extremely* limited input (ie input you created yourself). – bobince Oct 17 '10 at 03:05
  • Hehe. Believe me, it's the right way for this. It's only for entertainment purposes, so an occasional hiccup in the resulting html page is totally acceptable. Here's the regex for pulling out links (href, src, action, url, background etc). The actual link (as well as the leading attribute) can be polled through theirs capture groups during match iteration: `\b(((?:(?Psrc|href|background|action|url) *(=|:) *(?P"|'| ))(?P[\w/.?=#&@:%+,();\-\[\]]+)(?P=mh))|(?Purl) *\((?P"|'| *)(?P[\w/.?=#&@:%+,();\-\[\]]+)(?P=mc)\))`. Try it. – d7samurai Oct 17 '10 at 03:14
  • Also, since I knew nothing about regex until today, when I finally decided to check out what it was all about earlier tonight, it's kind of a fun thing to play with, too :) And believe me, the code is clean and simple. In only a matter of about 10 lines of code, I can scan through an html document and reconstruct all its links to absolute (if they are relative) with a precision that is pretty high for something like this. I don't see that happening with DOM like models.. – d7samurai Oct 17 '10 at 03:21
  • And no, it's not for input that I create. In fact, this is what it does: You give it a link to some web page. Then you give it a list of words that you want replaced, along with corresponding replacement words. The app takes the html code, reconstruct all relative links to absolute, and then does the word replacement. In the end you have a document that looks perfectly like the original, but with an absurd twist, content wise. The page is stored as a compacted binary in a database, and can be served from a totally different server than it originated from, since all links are absolute. – d7samurai Oct 17 '10 at 03:24
  • @bobince Btw - here's a screenshot of how the regex for finding html links handles the source for the cnn.com frontpage: http://www.martinwardener.com/regex_links.jpg . so far i haven't seen it mess up once (except it won't find links that are "camuflaged" / constructed in scripts or just parts of paths that are built later - but then again, what generic routine would?). The regex will find not only href, but src, url (in css references) and action, background and url atrributes. and they can quote their links with ", ', or be unquoted. It seems to work quite well. – d7samurai Oct 17 '10 at 05:34
  • And finally - the way this regex finds links is independent of the tagging. So unescaped tags won't even affect it. If the link itself is well formed (which it would be, or it wouldn't link), it works. The rest of the html code can be as malformed it wants to :) – d7samurai Oct 17 '10 at 05:54
  • 3
    You consider the above long, messy link-attribute regex ‘clean’? Then I don't know what to say to you. It's a complicated and far-from-complete attempt to detect two completely different and incompatible syntaxes (CSS and HTML attributes) in one, that can be broken in about a hundred ways even by perfectly valid markup. HTML escapes, CSS escapes, `>` in attributes, IRI, false matches in attribute attribute values, whitespace around quoted attributes, `url()` not detected properly... This sort of this is really easy to do *correctly* with an HTML parser and hideously impossible in plain regex. – bobince Oct 17 '10 at 13:20
  • The regex is long only because i put literal names into it. Don't confuse the written line with 'mess' just because it's hard to read. All this will be *compiled* into symbols by the engine anyway. And no matter how you look at it, any parser needs to do traversing and string matching - just because it's hidden from you in doesn't make it 'cleaner'. And regex is very efficient at string matching - which needs to be done in any implementation. If you wanted to, you could write it much 'cleaner' if the *look* of a regex string bothers you. – d7samurai Oct 17 '10 at 19:57
  • `\b((((src|href|action|url) *(=|:) *("|'| ))[CHARS]+\6)|url *\(("|'| *)[CHARS]+\7\))` – d7samurai Oct 17 '10 at 19:59
  • Now, this regex does the same job (sans the first "background" attribute, which is obsolete anyway). Just substitute CHARS for whatevr characters you are allowing in the link. Since this is *not* a *generic* html parser, but just looking for links (of attributes you can decide), there is not much that will break it. it handles whitespace around attributes and their values. Remember, it won't pick up malformed links anyway, which is fine. And it's compiled before use in the application. Ìt doesn't look for "css" sections, or specific tags, just links. It's actually *very* clean - and efficient. – d7samurai Oct 17 '10 at 20:04
  • The fact that you have to spell out the attribute names *somewhere* doesn't make it "messy". As I've been saying here, I'd be interested in seeing a "proper" HTML parser implementation with the same functionality, to compare amount of code, precision and execution speed. Its purpose is to list all links in a document, using the attributes src, href, action and url. – d7samurai Oct 17 '10 at 20:06
  • The regex isn't complicated because of the names in it, it's complicated because of the number of paths through the expression, making it hard to read, covering all the different cases it tries to handle... and fails, in many cases: seriously, that's not what URLs look like in CSS; and it really doesn't handle whitespace and attribute quoting anything like what the HTML grammar says... I don't know why you're so confident that “nothing” will break it. – bobince Oct 17 '10 at 22:22
  • I think you need to forget the DOM paradigm for a second. You're not seeing the forest for the trees here. When there is no tag parsing, there is nothing to "break". A `>` or `<` won't matter to the regex, because it's not looking for any. And unless you allow it to accept `<>` as a valid url characters, it won't pick up any. An html file is really nothing more than a text file. Within that file you're looking for links. Not having to parse tags and nesting saves a lot of effort. And since links in html documents *are* "marked", it makes it even easier. – d7samurai Oct 17 '10 at 22:33
  • And regarding code paths. Seriously, if there's anything a DOM model does, it's persuing bifurcating paths. The regex is simply a very compact way to describe what you're looking for. It doesn't have to first parse the text into DOM compliant entities and structure them - it simply looks for what you want, since it doesn't matter *where* in the structure it is located. The only thing one might want to filter for is "part of html code" and "not part of html code", but i made another regex for that, if that was a requirement. – d7samurai Oct 17 '10 at 22:36
  • Links in html are formatted like this: attribute="link", attribute='link', attribute=link, or even attribute: "link" etc, for the various attribute types allowed in the regex (all of these can also have spaces before or after the : or =, it will still pick them up). in css, it's url("link"), url('link'), url(link) (or in some cases with imports, url "link", but i haven't added that (which would be easy). Also with spaces before or after the parenthesis. It will still pick it up. So tell me, how can a valid link be "broken" or not picked up? And how is it more efficient to run this through DOM? – d7samurai Oct 17 '10 at 22:41
  • The reason there are two "paths" in the search is mainly because in css, the parenthesis used to mark the link uses a different character to mark the start `(` and the end `)` of the link. Please give me an example of a valid link in a piece of html code (otherwise compliant or not) that this regex won't pick up or mess up or be "broken" by. – d7samurai Oct 17 '10 at 22:46
  • What did you mean by it not handling whitespace? It does. Or attribute quoting? It handles all types of attribute quoting. And how hard the regex is to read to some person is totally irrelevant. It's made, and it works. No need to read it. Besides, it's not really that hard to read either, any more than regex generally is. But that's not the point of regex, is it. I used to do assembler programming on the 6502 and the 68000 processors years and years ago. Believe me, VB.Net is child's play compared to that, but it's far more efficient, even though it's not very reader friendly. – d7samurai Oct 17 '10 at 22:50
  • The regex is only interested in links that are coded so that they actually work as links, so "links" that are so malformed that they would not link to anything anyway are ignored. Which is good. No need to push DOM as if my point is to propose regex as an alternative to general html parsing models. But in this particular case, I have yet to see anyone show me an alternative, DOM-based method that does this easier, faster or more precise. – d7samurai Oct 17 '10 at 22:59
  • PS. How do you think DOM parsers work internally? By using a DOM parser? ;) – d7samurai Oct 17 '10 at 23:41
  • 1
    DOM-constructing parsers use a variety of string methods (yes, potentially including regex) to parse the low-level tokens of the basic grammar. (And yes, I've written one, and yes I'm also an assembler coder, thanks.) But regex really doesn't have the power to parse higher-level constructs. I am aware that CSS `url` tokens use parentheses, however, the expression pasted above does not contain any literal parentheses and won't match such a URL. It does seem to allow `url` as an attribute, which doesn't exist. It also can't cope with CSS-escaping or HTML-escaping inside the value. – bobince Oct 18 '10 at 00:11
  • No, it actually *does* include that: `url *\( *("|'|)[CHARS]+\7\)`, hence the last main alternation (at least my actual regex does - I now see that the `\`s before the `(` and `)` containing the link is missing in the ones i posted. I guess I must have been too eager when I was trimming it during post. Sorry). – d7samurai Oct 18 '10 at 00:20
  • Damn, now I see why - it even happened again - it must happen when "difficult" strings like that are posted here.. Well, at least you should be able to see that the parenthesises need for it are there, it's just the backslashes to make them literal that are missing. Yes, it allows url as an attribute, to also sweep up some url attributes in scripts etc, of which many have the same format. But regarding escaping - what escaping doesn't it cope with? – d7samurai Oct 18 '10 at 00:25
  • As for regex - of course it doesn't have the power to deal with higher level constructs. As I've been trying to say - I'm not posing regex as an alternative to DOM etc in that regard. I have repeatedly stressed that this is a specific task: to find links in an html document. No more no less. Doesn't matter where the link is, doesn't matter what it's for. Just find the links. And with that in mind, I see no reason to put this through some bloated, relatively speaking, higher level parsing mechanism, that will in fact make it clumsier, slower and require more code. – d7samurai Oct 18 '10 at 00:28
  • And the same goes for the original regex - for filtering out all code from a document to be left with only the text - and doing that efficiently while retaining the original structure of the document. Please please please show me a better way through DOM for any of those tasks. – d7samurai Oct 18 '10 at 00:29
  • And if you've written a DOM parser (from scratch), I'm even more baffled why you don't see that for a specific pattern search like this, it is an unnecessary detour to first have a parser parse for generic tokens, build an object hierarchy (including a lot of unneeded data and processing), just to have it then iterate through all its branches and leaves on my behalf to look for almost the same patterns I can get by scanning the flat source document itself directly. – d7samurai Oct 18 '10 at 00:38
  • Here's a link to a little page I set up: http://martinwardener.com/regex. It uses the aforementioned regex patterns to extract links and text from html pages. Simply listing the links or the text will give you a parse time (that includes building the string of html markup for displaying it in the browser. You can also see the full html markup of the page you entered, with the regex matches highlighted. It might give you an idea about how reliable and precise it actually is. A comparison with agility pack would have to include finding all the non-href-links, too – d7samurai Oct 18 '10 at 04:05
  • Hi! I saw your "trip-up-suggestions" :) Some of them are moot regarding my implementation, since it is slightly modified from the one you are basing this on. The missing URL characters are added - that's just an oversight, and I added those (thanks). But the character string can be whatever one wishes to allow, so it doesn't point out any flaws in the regex structure. Some codes (like tab and newline) can be stripped out in a pre-parse document trim, like my web example already uses (optionally) - or just added to the regex. The unquoted and spaced-at-front-but-missing-at-back ones are handled – d7samurai Oct 18 '10 at 10:21
  • Your input is helpful, since I'm not up to speed on (especially) the css escaping and folding issues. Then again, I'm not so sure it's a problem in this case - both because it to a certain extent can be stripped out before parsing, and secondly I am not so sure how much it's encountered in the wild, within a html file. Regarding the escaping in the middle of the css 'url' attributes - who would want to do that anyway? It's a tradeoff - and as I've mentioned, for these purposes, it's OK to slip a bit on semi-obscure notation. – d7samurai Oct 18 '10 at 10:37
  • Even without pre-processing the html and special escape handling in the regex, half of the examples are non-issues in the actual implementation, with missing characters being the problem with a handful alone (some of them were already in the regex in use). – d7samurai Oct 18 '10 at 10:43
  • Then it's the issue of detecting "unconventional" links within script blocks etc (ref comments under Vantomex' answer) - which I'm not sure Agility et al picks up on? Again - check out http://www.martinwardener.com/regex to easily see how the regex performs (both speed-wise and detection-wise) on real world html pages. – d7samurai Oct 18 '10 at 10:53
  • Also, remember that cases where the pattern wrongly detects a "link" are generally not a problem, either - because the point of the link pickup is to check whether they need rewriting (from relative to absolute), so they'll all go through some validation. If they don't qualify as links, they won't be touched anyway. – d7samurai Oct 18 '10 at 11:00
0

For Your Information,

Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.

$("<div/>").html("#elementId").text()

You can refer this JSFIDDLE

Birlla
  • 1,700
  • 2
  • 15
  • 17
0

Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.

If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.

Vantomex
  • 2,247
  • 5
  • 20
  • 22
  • As I said in my post, I'm using VB.Net in Visual Studio 2010. Also, this is not a critical application - it's a entertainment utility, so if some pages have malformed html, that's a just a minor scratch in the paintjob. Also, I don't see why it wouldn't handle nested tags? As far as I can see, any tag is masked out - after the "problem tags" like SCRIPT, STYLE and comment are removed, it doesn't concern itself with the document structure at all. Which is good. The point here is to just have access to the raw text without messing with the structure. How would this be done in DOM? – d7samurai Oct 17 '10 at 01:29
  • Even it is not a critical application, I bet, regex should fail to parse and extract bad HTML documents spread everywhere in the Internet. As I said, DOM library should exists in every programming language. Also, regex cannot parse nested tags. – Vantomex Oct 17 '10 at 01:47
  • Could you assure every tags in HTML documents are properly closed, for example tag P, H, LI, etc. ? Also, SCRIPT tags might have contain tags inside it. – Vantomex Oct 17 '10 at 01:52
  • I don't see how a regex that blindly masks out anything enclosed in < and > can fail (apart from where there's no opening/closing bracket, in which case the document is to blame for the garbled up results). And as you can see from the regex, it masks out whole blocks of code by not only searching for bracket markers, but whole sequences of code (as in, starting wih ` – d7samurai Oct 17 '10 at 01:53
  • IMO, when browser can display HTML documents with unclosed tags properly, our program should be able to extract them properly too. As HTML 4 Specification says, many tags doesn't need to have their closing tags. It's an official specification. – Vantomex Oct 17 '10 at 02:00
  • There is an exception of course, regex can be used to parse particular HTML documents which the format is already known. – Vantomex Oct 17 '10 at 02:02
  • Well, reliability isn't so important here. Making it simple and clean is the priority (although, the time I've spent commenting here now probably cancels out any gain from that :) – d7samurai Oct 17 '10 at 02:30
  • One more example, supposing in your HTML document contains `P` tag like this: `

    test test test test test

    `. After you use `

    (.*?)

    `, you get the content of `P` tag so what next? The extracted result still contain many tags, and you don't get what you said as "textual content".
    – Vantomex Oct 17 '10 at 02:53
  • Wrong. The regex I showed in the original posts correctly picks up *all* tags and separate them from plain text (which is also picked up, but as a separate match), due to the sequence of alternate patterns. It returns a collection of matches, effectively covering the complete document, but since I have created a capture group for the "just plain text" pattern - with a name reference - I can check whether that capture result has content. If it does, it means it was a "pure text string" match, and I can process the match string. – d7samurai Oct 17 '10 at 03:38
  • As you can see up top, it's not a simple `

    (.*?)

    ` pattern. The regex looks like this: `(?:(?:<(?Pscript|style)[\s\S]*?(?P=tag)>)|(?:)|(?:<[\s\S]*?>))|(?P[^<>]*)`.
    – d7samurai Oct 17 '10 at 03:39
  • (and yes, I tested pasting in your example string in my test documents, and it isolates all of it perfectly. Using RegexBuddy, you can see each match highlighted in alternating colours, so it's easy to see what each match contains. All tags are separate matches, and all the text strings are sepate matches - identifiable during iteration because they are tagged with "text").. – d7samurai Oct 17 '10 at 03:44
  • OK, I missed your current regex part, so how to solve this: ` – Vantomex Oct 17 '10 at 04:10
  • Same thing there. Those tags are picked up and isolated from the text in between automatically, just as everything else. The text that is left is *only* plain *content*, which is the purpose. As you'll see in my answer above (with complete VB code), I have solved the actual goal, only not *completely* in regex. What you suggest would be too cumbersome, since I don't only need the text extracted, I need to *replace* a set of given words in it - and *put it back into the html* so that the page behaves exactly as it did. – d7samurai Oct 17 '10 at 04:45
  • Simply taking away the html means I'd have to build a mechanism for keeping track of which pieces of html/script etc goes where, and where to put the text back in. That's why this is preferable - since you can just hook onto the match iteration and replace as it goes along, preserving the document all the way. – d7samurai Oct 17 '10 at 04:46
  • In fact, I took a screenshot of the regex at work, in RegexBuddy, where you can clearly see how it picks out everything but the text with perfect consistency. I grabbed the source html from this page on newscientist.com: http://www.newscientist.com/article/mg20827826.200-a-3d-model-of-the-ultimate-ear.html. The picture is here: http://www.martinwardener.com/regex.jpg. As you can see, I pasted your examples in there, as well, so you can get a more visual impression of it. – d7samurai Oct 17 '10 at 04:55
  • No @Martin, the "

    ,

  • , and, " inside `
  • – Vantomex Oct 17 '10 at 05:11
  • But in html that would be escaped to ´The &ltp&gt, &ltli&gt, and, &lttable&gt elements are very common`, so my regex would include them if they were part of the page's textual content. – d7samurai Oct 17 '10 at 05:38
  • Yes if your program only process HTML documents made by you. – Vantomex Oct 17 '10 at 07:31
  • Eh? All html - by definition - requires escaping < and > to &lt and &gt if it is to appear as plain text. If it isn't escaped, it is part of the html code and will represent tags - which are then, rightfully, masked out by the regex. It will also not show on the page as textual content to begin with, so you have to choose. Either it appears as part of the visible content, in which case it is escaped (and handled correctly by the regex), or it is coded literally as < and > in the document, in which case it is part of the code (and handled correctly by the regex). – d7samurai Oct 17 '10 at 07:42
  • Yes, by definition, but not by practice. I just give one popular website as a sample, have a look at all "try" pages in `www.w3schools.com`, e.g. `http://www.w3schools.com/html/tryit.asp?filename=tryhtml_intro` – Vantomex Oct 17 '10 at 07:54
  • I'm not sure I'm following. That's an html editor... Why don't you give me some example html code that you think would cause the regex to fail, and I'll test it out. – d7samurai Oct 17 '10 at 08:03
  • Have you done a survey on HTML documents out there? They are usually mess/bad HTML codes, e.g. `

    The < sign is a less-than operator

    ` or `

    The <> sign is an unequal operator

    `. All browsers rendered them correctly, even IE5, or maybe even IE4.
    – Vantomex Oct 17 '10 at 08:30
  • No I haven't LOL. But even so, number one, it's on them. Number two, for my purposes, it would not create a big problem, it would only mask a few words more (between the < sign and the next tag, whatever that would be). It's a problem when you're attempting to parse nesting etc, but as I've said, here the engine is just masking everything, so it doesn't attempt to "make sense" of the tag pairing. – d7samurai Oct 17 '10 at 08:37
  • Example number one works? Of course not, at first, your regex matches `

    `, then it matches `< is a less-than operator

    ` not ``, so your regex considers the problematic textual content as a part of the last `` tag. No, I haven't LOL too here.
    – Vantomex Oct 17 '10 at 09:29
  • Finally, as I said before, if you insists on using regex (regardless its weakness and unefficiency), you don't have to inverse the matching, simply delete the matching pattern and store it in a variable, then you'll get the inverse of it. There is no way in Regex to invert the matching of whole match. PowerGREP was based on open source TPerlRegex library. Look into it. The inverse feature in PowerGREP is not done by a regex magic formula. So, we need a dirty trick (as I mentioned above) to achive the same result. – Vantomex Oct 17 '10 at 09:51
  • Well, as long as I have to do the final processing in code outside the regex itself, what I have done (and posted the code for as an answer to my own question here) is much more efficient. Deleting it and storing it in a variable would mean I'd have to keep track of which pieces of text belonged between what pieces of code etc. Instead, the regex matches *everything* in the code, except that it tags the plain text through its capture group. So during match iteration/regex replace, I just check for the tag and do the processing of the text *as it's traversing* the document. It works very well. – d7samurai Oct 17 '10 at 10:04
  • Yes, of course it does consider the "< is a less than operator" as the closing tag for the

    . Not really much to do about that, unless you want to take it upon yourself to clean up the web for people. But the problem is miniscule. But regarding the efficiency.. 1) have a look at the code I posted. 12 lines of code. 2) It's very fast - and the regex is compiled for performance. 3) It's very accurate. Show me the corresponding code with the technique you suggest - I'm curious to see how that looks and performs.

    – d7samurai Oct 17 '10 at 10:09
  • I must confess, I has never learned VB/VB.NET, but three weeks ago, I have been starting learning VBA. So, I understand some of your code, but not all. Thus, I couldn't speak much about your challenge. Good Luck! – Vantomex Oct 17 '10 at 10:51
  • But VB.Net or not. Since you seem to be familiar with DOM (in whatever language), I'd be interested in seeing how you would implement a subroutine that would find *every* parseable link in a given html document and allow to you iterate through them.. – d7samurai Oct 17 '10 at 23:04
  • See the *very first example* at htmlagilitypack's examples page for a trivially simple and *correct* method to extract all `a href`​s. Repeat for any other URL-containing attributes you want. If you need CSS you can extract that from inline styles and stylesheet, but again, using regex on it isn't reliable—though more reliable than using regex over CSS-in-HTML. – bobince Oct 18 '10 at 00:44
  • Well. Efficiency-wise, how much parsing do you think is going on under the hood in that operation? That parsing is being done, even if I'm not the one coding it. So efficiency-wise, I think the overhead is way way higher with agility pack (for this particular purpose). In addition, the regex doesn't just find href links, it finds a bunch of different ones. And it also finds links within scripts etc. The point here is to rebuild the links in a document (primarily from relative to absolute), so the document will function when served from another server. – d7samurai Oct 18 '10 at 03:51
  • Here's a link to a little page I set up: http://www.martinwardener.com/regex/. It uses the aforementioned regex patterns to extract links and text from html pages. Simply listing the links or the text will give you a parse time (that includes building the string of html markup for displaying it in the browser. You can also see the full html markup of the page you entered, with the regex matches highlighted. It might give you an idea about how reliable and precise it actually is. A comparison with agility pack would have to include finding all the non-href-links, too. – d7samurai Oct 18 '10 at 03:57
  • Question: How does Agility pack handle hrefs in tags like ``? – d7samurai Oct 18 '10 at 04:22
  • or ` – d7samurai Oct 18 '10 at 04:38
  • `window.location.href = 'http://htmlagilitypack.codeplex.com/Wiki/Search.aspx' + '?tab=Home&SearchText=' + searchText;` – d7samurai Oct 18 '10 at 04:51
  • When I said unefficiency of Regex, I was comparing Regex with XPath (not DOM), XPath is faster than DOM in execution speed view. I said in my answer, used DOM for simplicity AND reliability because you said your program is not a critical application. Do you guess Regex should be faster than XPath when parsing HTML documents? Regex was not designed for working with HTML docs (they are not regular). Efficiency here means effeciency of burden of processor. Regex often tries MANY possible permutations to match a pattern, especially when use alternate operator `|`. – Vantomex Oct 18 '10 at 05:00
  • One more, even if for your specific cases you find that Regex is faster than XPath, that is NOT efficient, because efficiency means "a minimum effort to achieve reliability". Forget to say, one line of code doesn't always faster than 100 lines of code. – Vantomex Oct 18 '10 at 05:06
  • You have to distinguish between general HTML parsing and the parsing that is needed here. In this case, yes, I'm claiming regex is faster. All parsers need to perform something similar to regex at the lowest level, then use *that* information to build up a model of the document hierarchy. **Then** you can start doing your searches through the document - and as it turns out - you'll probably have to do many passes to pin down the same information as these regexes do. Check the link to a demo i made: http://www.martinwardener.com/regex/ – d7samurai Oct 18 '10 at 05:09
  • As I showed earlier, this takes a regex (that is *compiled* in itself before the application runs) and gets all the job done with a few lines of code. Even XPath would have to *somehow* search the whole text to pin down the tokens it is looking for - and probably even uses regex for that internally (!). Then you have a bunch of integrity checks and creating objects of the parsed data and organizing it. Then *you* can start actually searching through it. It's of course the price to pay for general capabilities. Specialization will always be better at some particular task. That's just how it is. – d7samurai Oct 18 '10 at 05:15
  • The "regular" in "regular expressions" is not in reference to the text you are searching *through* - it is to the fact that what you are searching *for* has a regular form - a pattern. – d7samurai Oct 18 '10 at 05:18
  • No, (X)HTML/XML parsers doesn't use permutation as Regex. Sometimes Regex accidentally meets a worst case. Feel free to post here your benchmark result and the methods you use to benchmark it, hopefully, you are willing to write a definitive conclusion. Sure, I will vote up if apparently you are the right one. – Vantomex Oct 18 '10 at 05:26
  • I'm not talking about how the parsers work *after* they have scanned the raw document for tokens, I'm talking about how they scan it in the first place. As in, the source code for the parser itself, not the user interface. At some level, there will be plain text searching going on through the raw data. If the parser uses more time simply reading the document and creating the object model than the regex does on the whole job, it's already lost the efficiency test. If parsers somehow use some magic method for searching through text, I'd like to know how to use *that* directly. – d7samurai Oct 18 '10 at 05:33
  • As for "voting up".. Lol. This whole thing was a question about whether regex was able to provide a certain functionality, not a competition between regex and various DOM implementations regarding general html parsing. But everyone seems to be so set on promoting DOM-type models that they lose track of what this is about. As for benchmarking, the online demo I put up should give you some pointers, at least (although it's running as an asp.net application on a shared server at discountasp.net). – d7samurai Oct 18 '10 at 05:37
  • Well, for your last comment, I would say again efficiency means "a minimum effort to achieve reliability with the fastest speed". I have look your given link, it is amazing and I appreciate your work, but since the links there refer to an online link, I couldn't see the real speed it offered. – Vantomex Oct 18 '10 at 05:41
  • The task at hand is this: Find and extract as many links as possible in a given html document (in any variety). Then see how fast it's done, and how much code you need to write to do it. – d7samurai Oct 18 '10 at 05:42
  • LOL! Didn't you start the competition by offer a challenge in your previous comment? – Vantomex Oct 18 '10 at 05:44
  • (The other task is, of course, to find and extract all plain text from a html page). As for the demo, it doesn't count the time it takes to download the html - it starts the clock when it has the text and starts doing the search, and stops it when it has created the highlighted list of links/the plain text extract. But I included the option to see the markup, too, so that you can go through it and see what it has missed (if anything) and if it picked out something wrongly. It rarely does, and if so, it's a special case that generally is easily detected in the post processing (if you need any). – d7samurai Oct 18 '10 at 05:46
  • Well, I'd be happy to see the two methods go head to head, but then someone would have to make a .Net implementation in DOM.. – d7samurai Oct 18 '10 at 05:47
  • No, what I meant was that noone here is saying that regex is a challenger to DOM when it comes to general html parsing. I'm only using it for these particular tasks. And that's why it surprises me that noone seems to want to admit that in a "flat file" parsing scenario like this, DOM might not be the best solution. – d7samurai Oct 18 '10 at 05:49
  • And Vantomex, I have something for you LOL: http://www.martinwardener.com/booyaa/?article=14 – d7samurai Oct 18 '10 at 06:04
  • LOL! O damn good, thanks for publishing my personality and heroics to public. :-) – Vantomex Oct 18 '10 at 07:20
  • See, this is what the routine is inteded for :) It needs to isolate the text from the html so that it can perform search and replace on it. Then it detects all links and reconstructs them from relative to absolute where needed, before it packs the whole file into a database. It can then be retrieved like you just saw, and served from a different server with all links (and most functionality intact).. :) BTW, I just uploaded a touched up version of the online regex demo, you should check it out. After I removed the result color coding from the actual extraction loop, it's much faster. – d7samurai Oct 18 '10 at 08:11