1

I am working with regexes in Access VBA (VBScript Regular Expressions 5.5) and there's some behaviuor I dont understand. Is this normal regex behaviour? Why?

The input is

some html ... id="devices_internal_table">Some interestingText</a>
< more html

I need to find different things here, but I am stuck with this:

pregexp.Pattern ="devices_table_internal([.]*?)\n<"  REM (A1)
pregexp.Pattern ="devices_table_internal([.\n]*?)<"  REM (A2)

pregexp.Pattern ="devices_table_internal(.*?)\n<"       REM (B1)
pregexp.Pattern ="devices_table_internal([.""<>\n]*?)<" REM (B2)
pregexp.Pattern ="devices_table_internal([.""<>]*?)\n<" REM (B3)
pregexp.Pattern ="devices_table_internal((.*\n)*?)<"    REM (B4)

patterns A dont give any results while patterns B do.

  • isn't A1 equal to B1 ?
  • B1 suggests, that <>and" are part of . but why then doesn't A2 work (but B2 does)?
  • same goes for B4/A2: multiple lines followed by < works, but multiple [characters or linebreaks] followed by < doesn't ?

As I need some different regexes I am more interested in explanations to the three weird things than solutions as to how I might find the "interesting Text" ;)

1 Answers1

1

[.] is a character class consisting of just a period. . by itself matches any character.

[.]*, therefore, matches any number of periods (and would usually be written as \.* instead), while .* matches any number of characters.

Also, see the most highly-upvoted answer on stackoverflow for why you shouldn't try to parse HTML with a regular expression.

Community
  • 1
  • 1
Wooble
  • 87,717
  • 12
  • 108
  • 131
  • ... and the last three times I checked it out there was no Interesting Text (as before) because something else went wrong. So that explains B2 & B3. Thanks. Sometimes its embarassingly easy to overlook ones own mistakes. As to the regex/HTML that is clear if you want to parse a page. If you only need one tag/EndTag with a specified ID its still perfectly Ok. –  May 08 '12 at 11:41