Regex to extract first link on page inside another tag

Question

I've been trying set up a simple PHP API that will essentially retrieve information from another site in two steps. If a person were to do it, it would involve:

Searching the site
Clicking on the first result
Finding the information

The site is set up in a predictable way. I know what the format of searching the site is so I can create the search URL using PHP and the input to the API.

The link for steps 1/2 is formatted like this:

<h4><a href="somelinkhere" class="search_result_title" title="sometitle" data-followable="true">Some Text Here</a></h4>

I only want the somelinkhere, the hyperlink itself. I know that it is the first hyperlink on the page contained within an <h4>.

I tried a number of Regex expressions in combo with preg_match, but they have all been failing. For example, the following is one way of doing it that failed:

$url = "https://www.example.com/?query=somequery";
$input = @file_get_contents($url) or die("Could not access file: $url");
preg_match_all('/<h4><a [^>]*\bhref\s*=\s*"\K[^"]*[^"]*/', $text, $results);
echo "$results";
echo "$results[0]";
echo "$results[0][0]";

I did the last three echoes as I'm not terribly familiar with the format preg_match_all returns. I tried preg_match as well with the same result. I only care about the first such link, so I don't need preg_match_all, but if I could just get the first result, that would work also.

What is the best way to parse the page and get the first hyperlink in the h4 into a variable?

Your regex seems to work fine, use `preg_match` instead of `preg_match_all`, [`regex demo`](https://regex101.com/r/MkK1UE/1/) — Code Maniac, Sep 17 '19 at 01:18
@CodeManiac I pasted the page code into the regex site and it does seem to work there. But my PHP page doesn't, and when I do the echo I just see "Array[0]". — InterLinked, Sep 17 '19 at 01:21
You're not echoing properly, i.e `]*\bhref\s*=\s*"\K[^"]*[^"]*/m'; $str = 'Some text
Some Text Here
; preg_match($re, $str, $matches, PREG_OFFSET_CAPTURE, 0); // Print the entire match result echo $matches[0][0]; ?> ` — Code Maniac, Sep 17 '19 at 01:27

score 1 · Accepted Answer · answered Sep 17 '19 at 02:20

1

Maybe, if you only like to extract the first h4, then you might want to modify it to,

(?i)<h4><a [^>]*\bhref\s*=\s*"\s*([^"]*)\s*".*

with an i flag.

$re = '/(?i)<h4><a [^>]*\bhref\s*=\s*"\s*([^"]*)\s*".*/s';
$str = '<h4><a href="somelinkhere" class="search_result_title" title="sometitle" data-followable="true">Some Text Here</a></h4><h4><a href="somelinkhere" class="search_result_title" title="sometitle" data-followable="true">Some Text Here</a></h4>
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches as $match) {
    print($match[1]);
}

Output

somelinkhere

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

answered Sep 17 '19 at 02:20

Emma

27,428
11
44
69

@mickmackusa I specifically asked for a Regex solution, and it works. I didn't see you offering anything better. – InterLinked Sep 17 '19 at 10:33
1

@Inter That is where you are confused. I made three offering that were better -- they are duplicate page links to educate you. Also, just because you _think_ you need a regex solution doesn't make ot the right choice. I could have offered a superior regex solution but I don't make this mistake anymore. On multiple occasions I have helped Emma. This is not about bullying anyone. This is about doing what is right for researchers and Stack Overflow. – mickmackusa Sep 17 '19 at 11:33
@Inter if you are using Emma solution instead of writing very similar to https://stackoverflow.com/a/4703043/2943403 then you are making a mistake and you will waste time in the future repairing the regex pattern to suit fringe case scenarios. I will only be banned from this site if I down vote Emma's suboptimal answers, this is why I am asking her to raise her game so that people learn best practices. – mickmackusa Sep 17 '19 at 11:47
@mickmackusa There are no fringe cases. This is a very specific application and it will always work. The hyperlink is always formatted that way in an h4. Now, if this were more general, that's one thing, but I'm not trying to be general. – InterLinked Sep 17 '19 at 19:32
If your html is consistently formatted in the style that you have in your question and you have complete control over the structure of the html that you are parsing, you could use this superior regex https://3v4l.org/5aj5F . If the html has different casing, quoting, attribute order, spacing, etc uh oh. My next point is that Stack Overflow is a site that aims to help many by helping one. A regex solution may work (and it might not) for you in your scope, but fail for the great majority of future researchers. This is why those of us who know/care insist on the use of practices that are stable. – mickmackusa Sep 17 '19 at 20:42
@mickmackusa I can understand where you're coming from, but it isn't anyone's job to force his personal preferences on others. If somebody wants to use Regex, I don't see why anyone should forbid him from doing so, just because it might not be his preference. To add to that, all the questions of which you marked this as a duplicate have nothing at all to do with the answer provided here; most provide little or no mention of Regex. Considering my question was *explicitly* about Regex, this is **not** a duplicate question. And it will definitely help those who are looking for a Regex solution. – InterLinked Sep 18 '19 at 01:01
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/199594/discussion-between-mickmackusa-and-interlinked). – mickmackusa Sep 18 '19 at 03:02

Regex to extract first link on page inside another tag

Some Text Here

1 Answers1

Output