Regex for matching one url but not the other

Question

Completely new programmer here having trouble with regular expressions despite trying various online regex testers. I'm working in Eclipse on an Android project I'm querying an openx ad server for a text ad and getting this in return:

var OX_abced445 = '';
OX_abced445 += "<"+"a href=\'http://the.server.url/openx/www/delivery/ck.php?oaparams=2__bannerid=29__zoneid=3__cb=e3efa8b703__oadest=http%3A%2F%2Fsomesite.com\'target=\'_blank\'>This is some sample text to test with!<"+"/a><"+"div id=\'beacon_e3efa8b703\'style=\'position: absolute; left: 0px; top: 0px; visibility:hidden;\'><"+"img src=\'http://the.server.url/openx/www/delivery/lg.php?bannerid=29&amp;campaignid=23&amp;zoneid=3&amp;loc=1&amp;cb=e3efa8b703\' width=\'0\'height=\'0\' alt=\'\' style=\'width: 0px; height: 0px;\' /><"+"/div>\n";
document.write(OX_abced445);

I need to extract the first href url but not the img src url so I figure I should have a regex that looks for everything between href=\' and '. I also need to extract the target text, ie. This is some sample text to test with! that is encapsulated between the _blank\'> and <"+"/a>. I've found plenty of regexes dealing with extracting urls and such but have struggled to get one working in Eclipse with this particular case. Any assistance would be appreciated.

You seem to be using regex to parse HTML. Please see the first answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ — Aurand, May 27 '13 at 20:07
Regex is not a great tool for this unless you know that the string format is quite rigid. — Old Pro, May 27 '13 at 20:07
http://stackoverflow.com/questions/1667278/parsing-query-strings-in-java — IvanH, May 27 '13 at 20:25
I had read other suggestions to the effect that one should use jsoup or some other dedicated html parser in these cases. My thinking was that because this query to the ad server will always return exactly the same result as above with the only difference being the url and target text I could get away with using a regex. Would you still suggest using jsoup or something else? — Charlie NS, May 27 '13 at 21:29

score 0 · Accepted Answer · edited May 23 '17 at 11:43

0

It is a very bad idea to try to parse JavaScript that generates HTML with regex. Use something like JSoup or Validator.nu for Java or Nokogiri for Ruby instead. If you must use a regex:

Plain regex:
^.*? href=\\'([^']+)\'[^>]*>([^<]*)<

or, in Java:

Pattern p = Pattern.compile("^.*? href=\\\\'([^']+)\\'[^>]*>([^<]*)<", 
                            Pattern.MULTILINE);
Matcher m = p.matcher(hideousString);
m.find();
// Now m.group(1) is the URL and m.group(2) is the text

will capture the href url in capture group 1 and the text in capture group 2, but that will break quickly if the site changes their response format.

edited May 23 '17 at 11:43

Community

1
1

answered May 27 '13 at 21:29

Old Pro

24,624
7
58
106

Thanks very much. The linked rant is actually quite helpful as it makes me feel much less like an idiot for not getting it to work initially. Could I trouble you to suggest, in order of preference for a noob skillset, alternative parsers that I should familiarize myself with? – Charlie NS May 27 '13 at 21:41
Thanks again. I do appreciate the warnings against using regex. I don't want my soul fed to the Old Ones quite yet. – Charlie NS May 27 '13 at 21:59
One followup question If I may? I've confirmed that the regex you provided works 100% as desired in Android yet I've tried maybe four different online Java regex testers and have yet to find any that produce the same successful result. Did you arrive at that regex just from familiarity with the syntax or did you use a specific tool ? I was wondering if the multiline nature of it was messing up the online testers. – Charlie NS May 30 '13 at 02:40
I am very familiar with regex so it didn't take me long to get this one right, and I tested it with http://www.regexplanet.com/advanced/java/index.html which is specific to Java and has all the options. Most regular expression implementations have a special behavior with respect to line terminators inside the string. See the Java Pattern DOTALL and MULTILINE [options](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#MULTILINE) for example. You need to use a tester that lets you specify non-default behavior or use a different pattern. See also `find()` vs. `match()`. – Old Pro May 30 '13 at 03:21
I must be missing something still. When I paste your regex into the regexplanet link along with my string it still never finds anything. The other oddness is that it wants to add additional backslashes when displaying the regex as a java string. I understand that if you want to represent a \ you actually need \\ but when I paste your regex `^.*? href=\\\\'([^']+)\\'[^>]*>([^<]*)<` it says the java string is `"^.*? href=\\\\\\\\'([^']+)\\\\'[^>]*>([^<]*)<"`. Could I trouble you to explain exactly how you use that site to make the results appear in the find column? – Charlie NS May 30 '13 at 21:25
Paste the plain regex, not the Java quoted regex, and select the MULTILINE option – Old Pro May 30 '13 at 22:17
Thanks. Got it to work after a number of tries. Something was somehow screwed up in the pasting into the input field. – Charlie NS May 30 '13 at 22:41

Regex for matching one url but not the other

1 Answers1