1

Possible Duplicates:
(PHP5) Extracting a title tag and RSS feed address from HTML using PHP DOM or Regex
Grabbing title of a website using DOM

I am trying to run through a hundred different html files on my server, and extract the titles for use in another php file.

For reference:

    <title>Generic Test Page</title>

What I need is a function that will return the string "Generic Test Page" and stick that into a global variable.

What I am doing right now is simply reading the file into an array called $lines. Foreach $lines as $line, I am testing for the string < title> ... but how do I extract only what's between the > and < /title?

My trouble is that sometimes the original developer decided to elaborate on the title: < title name=title class=title1>, or he put it on three lines instead of one. What in the world? So I can't just strip the first seven characters and the last eight characters. Which would be so nice...

Thank you!!

Community
  • 1
  • 1
EllaJo
  • 225
  • 1
  • 10
  • 1
    A solution can be found here - http://stackoverflow.com/questions/3054347/php5-extracting-a-title-tag-and-rss-feed-address-from-html-using-php-dom-or-reg – Jason McCreary May 10 '11 at 18:55
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon May 10 '11 at 19:05
  • I would gladly use preg_match or preg_split, but I can't figure out where all the extra characters came from. For example, why doesn't preg_split(">", $line) return an array with two parts, the first before the > and the other after the >. It keeps telling me that it can't find the delimiter. Ugh... – EllaJo May 10 '11 at 19:34
  • Okay, apparently I'm not supposed to do that. I see lots of complaints, but why is it bad? – EllaJo May 10 '11 at 19:43
  • Have you not tried searching for an answer? This issue has been addressed several times: http://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php http://stackoverflow.com/questions/2988055/extract-title-tags-from-normal-text http://stackoverflow.com/questions/3195851/how-to-extract-a-page-title – Strong Like Bull May 10 '11 at 19:57
  • I searched, but apparently I used the wrong terms. Sorry to duplicate, will be more careful in the future. – EllaJo May 12 '11 at 02:01

3 Answers3

4

You need to use something like PHP Simple Dom Parser

function get_page_title($html_file) {
  $html = file_get_html($html_file);
  $title = $html->find('title', 0)->plaintext;
  return $title;
}
scurker
  • 4,733
  • 1
  • 26
  • 25
  • 2
    Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon May 10 '11 at 19:05
  • Awesome! I wasn't aware of all of those alternatives, so that gives me something to compare against what I use currently. – scurker May 10 '11 at 19:16
  • You're welcome. Also see the related link given below the question. – Gordon May 10 '11 at 19:17
  • I figured out the DOM coding. Thank you so much for your help! – EllaJo May 11 '11 at 03:17
2

$line = each line.

 $pattern ='/<title[^>]*>(.*?)<\/title>/is';
 if( preg_match($pattern,$line,$match) )
   return trim($match[1]); # your title !

or just use the pattern on the whole html and return the match.

or use something scurker has suggested.

Paolo_Mulder
  • 1,233
  • 1
  • 16
  • 28
  • Please will you tell me what all the slashes and stars and parentheses mean? and do you need to define $match as an array, or is it automatically an array when it's stuck in as an argument? – EllaJo May 10 '11 at 19:52
  • Sure : * means zero or more , / is a function in a expression so you put \ in front to accept it ( \/ ) , [^>]* = means get all characters which are not > ( so in "sdgsdsdg sd..sdgsdgsd" would get eliminated. check out some tutorials : http://www.regular-expressions.info/tutorial.html – Paolo_Mulder May 10 '11 at 20:06
  • $match is just the name I gave to the array to store the "matches". You can name it whatever you want in the function : preg_match($pattern,$source,$ARRAY WITH RESULTS); it is always good to define the array before ( $match=array() ) see >http://nl3.php.net/manual/en/function.preg-match.php – Paolo_Mulder May 10 '11 at 20:09
  • This worked when the title was all on one line, but when the developer split it into three lines, the code broke. I think I just figured out why not to use regex. :-) Thank you for giving me a starting point and for telling me what everything means. You're very helpful. – EllaJo May 10 '11 at 20:13
  • Could you provide an example ? regex will do but the pattern probably has to change. – Paolo_Mulder May 10 '11 at 20:15
  • Great job this pattern works for me. Even titles split over multiple lines - like Technorati site. Thanks. – HGPB Oct 05 '11 at 00:52
0

You should use a regular expression to extract the inner part. More info here

Brian Geihsler
  • 2,047
  • 13
  • 15