0

I have made this regex:

(?<=span class="ope">)?[a-z0-9]+?\.(pl|com|net\.pl|tk|org|org\.pl|eu)|$(?=<\/span>)$

It does match the strings like: example.pl, example12.com, something.eu but it will also match the dontwantthis.com.

My question is how to don't match a string in case if it contains the dontwantthis string?

Scott
  • 5,991
  • 15
  • 35
  • 42
  • What's your client written in? – hd1 Dec 30 '12 at 04:35
  • @hd1 Oh sorry, its `PHP 5.4`. – Scott Dec 30 '12 at 04:40
  • **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php.html for examples of how to properly parse HTML with PHP modules. – Andy Lester Dec 30 '12 at 05:07

2 Answers2

3

You're probably following your regex with a loop to cycle through matches. In this case, it's probably easiest to just check for the presence of the dontwantthis substring and continue if it's there. Trying to implement it in regex is just asking for trouble.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • Could you explain more about the "trouble" in this case? I'm learning regex and just curious. – Scott Dec 30 '12 at 04:51
  • Have you seen the regex for an email? [This](http://www.regular-expressions.info/email.html) is the kind of thing born when someone tries pedantically to do everything in regex. – Niet the Dark Absol Dec 30 '12 at 05:39
1

It seems that you are extracting content from span elements using a regular expression. Now, despite all the reasons why this is not such a good idea...

... just keep the expression you have. Then, if you have a match, filter out the matched entries that should be rejected.

var $match = extractContentFromHtml($html);  // use regex here, return false if no match
if ($match && validMatch($match)) {
    // do something
}

where validMatch(string) should check if the value exists in some array, for example.

Community
  • 1
  • 1
Yanick Rochon
  • 51,409
  • 25
  • 133
  • 214
  • Well, I got your point, but in this case I don't really have to care about the [X]HTML formatting or something because only the `` tags will change - I don't really see the reason for parsing hundreds of results with an additional function, when regex (as far as I know) can exclude certain result if the text do contain a specified string. There are some conditions already, why another simple one may threat? – Scott Dec 30 '12 at 05:03
  • I'm not sure what is your raw input to extract content from. If it's just some simple `span` HTML strings (and not an entire HTML document), it's good enough. If it's a big HTML chunk, you should probably parse it and extract the text nodes from the span and collect them instead. Anyhow, AFAIK, regex are meant to match stuff, not the opposite, and it will not only be mentally safer having it done in two steps, it will be clearer and more maintainable :) – Yanick Rochon Dec 30 '12 at 05:08