-2

I am trying to extract multiple URLs from HTML file with regex. HTML code looks like this:

<h1 class="article"><a href="http://www.domain1.com/page-to-article1" onmousedown="return(...)
<h1 class="article"><a href="http://www.domain2.com/page-to-article2" onmousedown="return(...)
<h1 class="article"><a href="http://www.domain3.com/page-to-article3" onmousedown="return(...)
<h1 class="article"><a href="http://www.domain3.com/page-to-article4" onmousedown="return(...)

I would like to extract URLs only between <h1 class="article"><a href=" and " onmousedown="return(...) e.g. http://www.domain1.com/page-to-article1, http://www.domain2.com/page-to-article2, http://www.domain3.com/page-to-article3 etc etc.

Kris
  • 1,067
  • 2
  • 12
  • 15

1 Answers1

4

As already answered and commented, you shouldn't use regexes for this task. However, if you really insist on it, you could use this regex:

/\<h1 class="article"\>\<a href="([^"]*)" onmousedown="return/

A walkthrough of the creation of this regex:

  1. Well, what are you actually looking for? Something like this line:

    <h1 class="article"><a href="http://www.domain1.com/page-to-article1" onmousedown="return
    
  2. However, certain characters aren't allowed in regexes. In this example, the < and > characters are illegal. Therefore you should escape them, by adding a backslash (\) in front of the illegal character:

    \<h1 class="article"\>\<a href="http://www.domain1.com/page-to-article1" onmousedown="return
    
  3. This would only match the URL that's already in the regex. We want to match any URL. Generally, how could a URL look in this context? That's hard to say, as URLs exist in many different forms.

    One easy description would be: a URL is a bunch of text that doesn't contain the " character (as that would end the href attribute of your <a> tag). In regex, this would be [^"]: it matches any character except for ".

    We are not done yet something: a URL is not just one character except for ", but a whole bunch of characters. Therefore we add an asterisk (*) to the pattern ([^"]), which matches zero or more characters. This results in [^"]*. Now URLs of any length can be matched.

    We should not forget that we actually want to take the URL from the text (and not only match/detect it). By defining a group, the content of the group will be separately returned. You define a group by putting the pattern in brackets. The result: ([^"]*).

    Now we can substitute this into the pattern we started with:

    \<h1 class="article"\>\<a href="([^"]*)" onmousedown="return
    
  4. One of the last things we should do is tell the regex processor whether we want to match whole lines (i.e., only find results if our pattern matches a whole line), or parts of lines as well. We go with the latter option. To do so, we put the pattern in slashes:

    /\<h1 class="article"\>\<a href="([^"]*)" onmousedown="return/
    
  5. In the last step, we can add modifiers. These are like preferences the regex processor uses when matching your pattern. We add the i modifier, to make the pattern case insensitive:

    /\<h1 class="article"\>\<a href="([^"]*)" onmousedown="return/i
    

I recommend to take a look at a regex cheat sheet and try to understand what's happening in regexes. Add it to your bookmarks (or print it). Whenever you encounter a regex or need one for yourself, try to use it. Regexes seem like difficult magic if you're new to them, but it's very convenient if you learn to use them properly yourself.


Example use:

<?php

$html = <<<EOF
<h1 class="article"><a href="http://www.domain1.com/page-to-article1" onmousedown="return(...)
<h1 class="article"><a href="http://www.domain2.com/page-to-article2" onmousedown="return(...)
<h1 class="article"><a href="http://www.domain3.com/page-to-article3" onmousedown="return(...)
<h1 class="article"><a href="http://www.domain3.com/page-to-article4" onmousedown="return(…)
EOF;

preg_match_all('/\<h1 class="article"\>\<a href="([^"]*)" onmousedown="return/i', $html, $matches);

print_r($matches[1]);
// Array
// (
//     [0] => http://www.domain1.com/page-to-article1
//     [1] => http://www.domain2.com/page-to-article2
//     [2] => http://www.domain3.com/page-to-article3
//     [3] => http://www.domain3.com/page-to-article4
// )

?>
Jonathan
  • 6,572
  • 1
  • 30
  • 46