-4

How can I retrieve a URL from an HTML link with a specific begining title?

eg.:

<a href="http://urltoretrieve.ext/" title="specific title rest of all title">something</a>
<a href="http://otherurl.ext/" title="a generic title">somethingelse</a>

and use PHP to retrieve:

http://urltoretrieve.ext/

Thanks!

Gordon
  • 312,688
  • 75
  • 539
  • 559
DavideCariani
  • 273
  • 7
  • 21
  • http://php.net/manual/en/class.domxpath.php – Hannes Jan 09 '12 at 13:27
  • `my $url='http://urltoretrieve.ext/` – Toto Jan 09 '12 at 13:29
  • If this was tagged differently, say [tag:querypath], then `htmlqp($html)->find('a[title^="specific"]')->attr("href")` would be very easy. – mario Jan 09 '12 at 13:36
  • @mario put it as an answer. I exchanged the regex tag for html-parsing since the OP doesnt mention regex in the question at all so I'd assume the OP just assumed regex is the right approach for it. – Gordon Jan 09 '12 at 14:00

1 Answers1

3

You can use https://gist.github.com/1358174 and this XPath:

//a[starts-with(@title, "specific title")]/@href

This query means:

//a                      find all a elements in the html
[                        that
starts-with(             
    @title               has a title attribute
    'specific-title'     starting with this value
)                        
]                        
/@href                   and return their href attribute

Example (demo):

$result = xpath_match_all(
    '//a[starts-with(@title, "specific title")]/@href', 
    $yourHtmlAsString
);

Output:

array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(38) "<href>http://urltoretrieve.ext/</href>"
  }
  [1]=>
  array(1) {
    [0]=>
    string(25) "http://urltoretrieve.ext/"
  }
}

The result is an array containing the serialized innerHTML and outerHTML of the found attribute nodes. If you dont understand what a node is, check DOMDocument in php

Also see How do you parse and process HTML/XML in PHP?

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • 2
    Ha, no need. This already answers it nicely. Cool mini function! – mario Jan 09 '12 at 14:01
  • Why the result retrieves me 2 arrays? – DavideCariani Jan 09 '12 at 14:11
  • @davelab serialized innerHTML and outerHTML of the found attribute nodes because the function cannot know which one you want. If that is not what you want, you have to learn how to use DOM. The source code in the gist is a good starting point and there is also lots of examples at StackOverflow. – Gordon Jan 09 '12 at 14:16
  • ok, for xPath solution: how can i exclude the outerHTML array? – DavideCariani Jan 09 '12 at 14:28
  • @davelab it's an array, so just access the index for innerHTML if you are only interested in the innerHTML string/the value of the attribute, e.g. $result[1][0] – Gordon Jan 09 '12 at 14:38
  • ok a solved with xPath thank you – DavideCariani Jan 11 '12 at 08:28