How to extract href, title, and text data from an tag with a specific class value from scraped html?

Question

I've this regex expression for the preg_match_all() that matches correctly on regex101.com but not on my code.

The html element I'm trying to parse looks like this:

<a class="profile-link" href="CompanyProfile.aspx?PID=4813&country=211&practicearea=0&pagenum=" title="1-844-Iran-Law">Amin Alemohammad</a>

Which is found int the whole html curl result. Each block has the following eg.:

<li style="opacity: 1;">
   <a class="profile-link" href="CompanyProfile.aspx?PID=4813&amp;country=211&amp;practicearea=0&amp;pagenum=" title="1-844-Iran-Law">Amin Alemohammad</a>
   <!--<a class="profile-link" href="javascript:void(0)" title="1-844-Iran-Law">Amin Alemohammad</a>-->
   <img src="/Images/Uploaded/Photos/4813_1844IranLaw.png" style="max-width:140px; max-height:140px">
   <div class="results-profile">
      <h2>Amin Alemohammad</h2>
      <p><strong>Firm:</strong> 1-844-Iran-Law <br> <strong>Country:</strong> USA</p>
   <p class="blue"><strong>Practice Area:</strong> Iranian Desk</p>
   <ul>
      <li class="tel-icon" style="opacity: 1;">Tel: +1-202-465-8692</li>
      <li class="fax-icon" style="opacity: 1;">Fax: +1-202-776-0136</li>
      <li class="email-icon">Email: <a style="position:relative; z-index:9999;" href="mailto:amin@1844iranlaw.com">amin@1844iranlaw.com</a></li>
   </ul>
   </div><!-- results profile -->
      <img class="practice-logo" src="/Images/Uploaded/Logos/4813_1844IranLaw.png" style="max-width:185px; max-height:70px;">
      <a class="results-btn contact-btn" href="CompanyProfile.aspx?PID=4813&amp;country=211&amp;practicearea=0&amp;pagenum=" title="View Full Profile">VIEW FULL PROFILE</a>
      <!--<a class="results-btn contact-btn" href="CompanyProfile.aspx?PID=4813&country=211&practicearea=0&pagenum=" title="1-844-Iran-Law">CONTACT</a>-->
      <a class="results-btn website-btn" href="http://www.1844iranlaw.com" title="www.1844iranlaw.com">VIEW WEBSITE</a>
   </li>
</li>

The regex result

Group 1.    54-58   `4813` // company profile
Group 2.    71-74   `211` // country id
Group 3.    92-93   `0` // practice area
Group 5.    115-129 `1-844-Iran-Law` // company name
Group 6.    131-147 `Amin Alemohammad` // Person's name

What I have is:

preg_match_all('/<a class="profile-link" href="CompanyProfile\.aspx\?PID=(.*?)&amp;country=([0-9]{1,}?)&amp;practicearea=([0-9]{1,10}?)&amp;pagenum=\?" title="(.*?)">(.*?)<\/a>/', $result, $match, PREG_PATTERN_ORDER);
dd($match);

which returns

array:6 [▼
   0 => []
   1 => []
   2 => []
   3 => []
   4 => []
   5 => []
]

The number of matches are correct -> 5 matches in the string pattern but what I can't figure out is why it's returning empty values.

Thanks for any help in advance as I've tried so many approaches but for not the correct one or seeing what am I missing.

There's a `\?` in your regex which doesn't belong in there, after `pagenum=`. When you remove it works fine. `/(.*?)<\/a>/` should be `/(.*?)<\/a>/` — Matt Raines, Apr 26 '18 at 15:18
It works for me using the above regex and your sample result. — Matt Raines, Apr 26 '18 at 15:28
Picking just that block of code, yes, it works. But I'm getting the whole content from the `(...)` of the curl result. Probably something might be breaking it and returning empty values — McRui, Apr 26 '18 at 15:31
I don't know what the whole content looks like, so I can't really help you debug it. Perhaps there are line feeds in the content in between `` and ``? If so, you could add the [`s` modifier](http://php.net/manual/en/reference.pcre.pattern.modifiers.php) at the end of the regex. Or try the DOMDocument answer. Regexes are [notoriously bad](https://stackoverflow.com/a/1732454/5024519) at parsing HTML. — Matt Raines, Apr 26 '18 at 15:39
Thanks Matt. Have to figure it out the best way preg_match_all or the DOMDocument. — McRui, Apr 26 '18 at 15:42

score 1 · Answer 1 · edited Apr 29 '18 at 09:02

1

Instead of using a regex you could use DOMDocument.

To get the values from the href attribute you could use explode and parse_str.

$html = <<<HTML
<a class="profile-link" href="CompanyProfile.aspx?PID=4813&amp;country=211&amp;practicearea=0&amp;pagenum=" title="1-844-Iran-Law">Amin Alemohammad</a>
HTML;

$doc = new DOMDocument();
$doc->loadHTML($html);
foreach($doc->getElementsByTagName('a') as $a) {
    if ($a->getAttribute('class') === 'profile-link') {
        $parts = explode('?', $a->getAttribute('href'), 2);
        parse_str($parts[1], $output);

        echo 'Title: ' . $a->getAttribute('title') . '<br>';
        echo 'Text: ' . $a->nodeValue . '<br>';
        echo 'PID: ' . $output['PID'];
        // etc..
    }
}

Demo

edited Apr 29 '18 at 09:02

mickmackusa

43,625
12
83
136

answered Apr 26 '18 at 15:06

The fourth bird

154,723
16
55
70

When using `explode()` to cut a string in half, please write the `limit` value as `2`. This ensures that `explode()` is never over performing, and it tells future developers that the intent of your code is to cut the string in half. Win-Win. Just a suggestion. – mickmackusa Apr 29 '18 at 07:59
@mickmackusa Thank you for the suggestion. Feel free to edit it! – The fourth bird Apr 29 '18 at 08:00
Thanks The fourth bird. It looks much cleaner, less expensive and more efficient. I must admit I've not worked to much with DOMDocument so far but will invest some time on it. Thanks! – McRui Apr 29 '18 at 11:38
You are welcome. You should also definitely look at the answer from @mickmackusa. – The fourth bird Apr 29 '18 at 11:44

mickmackusa · Answer 2 · 2018-04-29T09:16:45.250

Code: (Demo)

$dom = new DOMDocument; 
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$output = [];
foreach ($xpath->evaluate("//a[@class='profile-link']") as $node) {
    parse_str(parse_url($node->getAttribute('href'), PHP_URL_QUERY), $output);
    $output['title'] = $node->getAttribute('title');
    $output['text'] = $node->nodeValue;
}
var_export($output);

Output:

array (
  'PID' => '4813',
  'country' => '211',
  'practicearea' => '0',
  'pagenum' => '',
  'title' => '1-844-Iran-Law',
  'text' => 'Amin Alemohammad',
)

I believe this leverages the full beauty of the php language with DomDocument with Xpath to reliably/directly target the qualifying tag/node, then parse_url() with parse_str() to eloquently convert the querystring data into the desired key-value pairs.

Now you'll have something stable with no hacky str_replace() calls or regex patterns.

thanks for using the proper functions for this, an XML parser to extract the url, parse_url to extract the query from the url, and parse_str to parse the query, you did it all **the way it should be done**, good job. :) — hanshenrik, Apr 29 '18 at 09:22

score 0 · Accepted Answer · answered Apr 26 '18 at 18:34

0

Well, after some time digging into the problem, analysing the whole html to be parsed by the preg_match_all() I just git it working by adding a couple of lines to replace the \t \r \n from the html since adding it to the regex didn't work.

So the solution wha to add the following two lines before the preg_match_all():

(...)
$result = curl_exec($curl); // already there

$result = str_replace(["&amp;"], "&", $result); // new
$result = str_replace(["\t", "\r", "\n"], "", $result); // new
$regex = '/<a class="profile-link" href="CompanyProfile\.aspx\?PID=(.*?)&country=([0-9]{1,}?)&practicearea=([0-9]{1,}?)&pagenum=" title="(.*?)">(.*?)<\/a>/s';

preg_match_all($regex, $result, $match, PREG_SET_ORDER);

Then, instead of having in the link the &I forced the & character in the regex. It's working like charm!

Than you all who have been there giving a hand!

answered Apr 26 '18 at 18:34

McRui

1,879
3
20
31

You are parsing HTML so you should be using an html parser like DomDocument for reliability. Instead of calling `str_replace()` twice, call it once and write an array of _search_ strings and an array of _replace_ strings. That said, if your answer has resolved your issue, please award your answer the green tick so that this page is deemed resolved in the system. – mickmackusa Apr 29 '18 at 03:55
Hi mickmackusa, thanks for the tips. When the answer was posted I wasn't still able to award the answer green tick. Thanks. – McRui Apr 29 '18 at 07:27
Is there only one profile-link class in the html? or are you possibly finding multiple matches in a single page? (I'm going to post a clever new approach) – mickmackusa Apr 29 '18 at 08:09
don't use str_replace, we have a proper function for decoding html, it's called [`html_entity_decode`](http://php.net/manual/en/function.html-entity-decode.php) - but yeah, don't use regex either, use a proper DOM parser. – hanshenrik Apr 29 '18 at 09:20

How to extract href, title, and text data from an tag with a specific class value from scraped html?

3 Answers3