0

I'm trying to use regular expressions for the first time and having some trouble, maybe with my syntax.

Here's a sample string contained in my source file I'd like to find:

Type = Creature / Animal / Elephant

"Type = " is static, however the three values between the forward slashes can change.

The search string I'm using is:

\bType = .*/.*/.*\b

My search string works fine, however my source file is HTML and some of the strings have HTML code embedded:

Type = Creature / Animal / Elephant 
Type = Creature / Animal / Elephant<br />
Type = Creature / Animal / Elephant</span></span></strong>

Stuff like that (it not very good HTML, maybe copy-pasted from Microsoft Word?)

For my search expression, this is one of the results:

Type = Creature / Many&nbsp;Fish&nbsp;/ Tuna&nbsp; </span></span></li

I don't understand why the result isn't stopping at "&" or "<" after Tuna.

Any thoughts on how my expression has to be changed to handle these variants?

I'm using working in VBA in Microsoft Excel, using the Microsoft VBScript Regular Expressions 5.5 library. Thank you.

SLeepdepD
  • 91
  • 1
  • 9
  • this is non-trivial as you have `/` in your data as well as a delimiter. You need to work out what makes your delimiter unique - how do you distinguish is from another `/` somewhere else. – Boris the Spider Apr 22 '13 at 18:48
  • @OmarJackman thanks for the reply. unfortunately the HTML is sloppy and not XML-compliant. i think it's either regular expressions or InStr() and Mid() functions :) – SLeepdepD Apr 22 '13 at 18:49
  • @bmorris591 thanks for responding. the way i understand it forward slash isn't a special character in regular expressions--backslash is though. – SLeepdepD Apr 22 '13 at 18:51
  • What 'code'/'tag-soup' surrounds the string you are after (eg: `Type = Creature / Animal / Elephant`). Is it always contained in some element (like `span`)? – GitaarLAB Apr 22 '13 at 18:58
  • hi. after seems to always be the start of a tag, (so "<"), a space, or " ". i thought using \b would return only what's before these three phrases. – SLeepdepD Apr 22 '13 at 19:08

1 Answers1

1

Your regex:

.*/.*/.*\b

Is consuming too much, since .* captures greedily. You could match them all reluctantly, but the logic you want here is a bit unclear with regards to making that work. So, instead, this will specify more precisely what should be matched.

[^/]*/[^/]*/ \w+

Rather than .*, using [^/]* meaning anything but a "/", so it will prevent greedily consuming past a slash, particularly when there are trailing slashes, as in a couple of your examples. \w+ is a space followed by 1 or more word characters (letters, digits, underscores). It will not consume whitespace or & but it sounds like that is the intent.

Really though, I suspect the better solution for you is to not use regex for this at all.

Community
  • 1
  • 1
femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • 2
    VBScript supports non-greedy matches, so you could use `.*?` instead of `[^/]*`. – Ansgar Wiechers Apr 22 '13 at 22:19
  • @AnsgarWiechers Yes, you Could, as I said. However, this: `.*?/.*?/.*?` is clearly wrong, as is `.*?/.*?/.*?\b`. `.*?/.*?/...*?\b` might do the job, assuming we won't run into extra whitespace there, but I find using `[^/]*` states the intent a lot more clearly. – femtoRgon Apr 22 '13 at 22:29
  • In my experience more special characters tend to make expressions less readable, thus I'd clearly prefer something like `.*?/.*?/ \w+` over something like `[^/]*/[^/]*/[^&<]*`. – Ansgar Wiechers Apr 22 '13 at 23:24
  • To each their own, of course. However, when I read `.*?` all I know is it is to read some stuff. `[^/]*` I can read that it is to read anything but a `/`, which is my intent. It states it's purpose. As far as `[^&<]`, I don't know where that came from. Of course, arguing readability of regexes is about the equivalent of comparing the fine bouquet of types of manure, so it doesn't really matter much. – femtoRgon Apr 22 '13 at 23:36
  • Thanks so much. This is what I ended up using: `\bType = [^/]*/[^/]*/[^<]*<` – SLeepdepD Apr 23 '13 at 16:51