Extracting string between and using PHP

Question

Possible Duplicates:
(PHP5) Extracting a title tag and RSS feed address from HTML using PHP DOM or Regex
Grabbing title of a website using DOM

I am trying to run through a hundred different html files on my server, and extract the titles for use in another php file.

For reference:

    <title>Generic Test Page</title>

What I need is a function that will return the string "Generic Test Page" and stick that into a global variable.

What I am doing right now is simply reading the file into an array called $lines. Foreach $lines as $line, I am testing for the string < title> ... but how do I extract only what's between the > and < /title?

My trouble is that sometimes the original developer decided to elaborate on the title: < title name=title class=title1>, or he put it on three lines instead of one. What in the world? So I can't just strip the first seven characters and the last eight characters. Which would be so nice...

Thank you!!

A solution can be found here - http://stackoverflow.com/questions/3054347/php5-extracting-a-title-tag-and-rss-feed-address-from-html-using-php-dom-or-reg — Jason McCreary, May 10 '11 at 18:55
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, May 10 '11 at 19:05
I would gladly use preg_match or preg_split, but I can't figure out where all the extra characters came from. For example, why doesn't preg_split(">", $line) return an array with two parts, the first before the > and the other after the >. It keeps telling me that it can't find the delimiter. Ugh... — EllaJo, May 10 '11 at 19:34
Okay, apparently I'm not supposed to do that. I see lots of complaints, but why is it bad? — EllaJo, May 10 '11 at 19:43
Have you not tried searching for an answer? This issue has been addressed several times: http://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php http://stackoverflow.com/questions/2988055/extract-title-tags-from-normal-text http://stackoverflow.com/questions/3195851/how-to-extract-a-page-title — Strong Like Bull, May 10 '11 at 19:57
I searched, but apparently I used the wrong terms. Sorry to duplicate, will be more careful in the future. — EllaJo, May 12 '11 at 02:01

score 4 · Accepted Answer · answered May 10 '11 at 18:57

4

You need to use something like PHP Simple Dom Parser

function get_page_title($html_file) {
  $html = file_get_html($html_file);
  $title = $html->find('title', 0)->plaintext;
  return $title;
}

answered May 10 '11 at 18:57

scurker

4,733
1
26
25

2

Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon May 10 '11 at 19:05
Awesome! I wasn't aware of all of those alternatives, so that gives me something to compare against what I use currently. – scurker May 10 '11 at 19:16
You're welcome. Also see the related link given below the question. – Gordon May 10 '11 at 19:17
I figured out the DOM coding. Thank you so much for your help! – EllaJo May 11 '11 at 03:17

score 2 · Answer 2 · answered May 10 '11 at 19:33

2

$line = each line.

 $pattern ='/<title[^>]*>(.*?)<\/title>/is';
 if( preg_match($pattern,$line,$match) )
   return trim($match[1]); # your title !

or just use the pattern on the whole html and return the match.

or use something scurker has suggested.

answered May 10 '11 at 19:33

Paolo_Mulder

1,233
1
16
28

Please will you tell me what all the slashes and stars and parentheses mean? and do you need to define $match as an array, or is it automatically an array when it's stuck in as an argument? – EllaJo May 10 '11 at 19:52
Sure : * means zero or more , / is a function in a expression so you put \ in front to accept it ( \/ ) , [^>]* = means get all characters which are not > ( so in "sdgsdsdg sd..sdgsdgsd" would get eliminated. check out some tutorials : http://www.regular-expressions.info/tutorial.html – Paolo_Mulder May 10 '11 at 20:06
$match is just the name I gave to the array to store the "matches". You can name it whatever you want in the function : preg_match($pattern,$source,$ARRAY WITH RESULTS); it is always good to define the array before ( $match=array() ) see >http://nl3.php.net/manual/en/function.preg-match.php – Paolo_Mulder May 10 '11 at 20:09
This worked when the title was all on one line, but when the developer split it into three lines, the code broke. I think I just figured out why not to use regex. :-) Thank you for giving me a starting point and for telling me what everything means. You're very helpful. – EllaJo May 10 '11 at 20:13
Could you provide an example ? regex will do but the pattern probably has to change. – Paolo_Mulder May 10 '11 at 20:15
Great job this pattern works for me. Even titles split over multiple lines - like Technorati site. Thanks. – HGPB Oct 05 '11 at 00:52

score 0 · Answer 3 · answered May 10 '11 at 18:57

0

You should use a regular expression to extract the inner part. More info here

answered May 10 '11 at 18:57

Brian Geihsler

2,047
13
15

Extracting string between and using PHP

3 Answers3