0

I wrote a code to curl the website content like facebook and google+ ,

 $html = file_get_contents_curl($url);

    if ($html) {
//parsing begins here:
        $doc = new DOMDocument();
        @$doc->loadHTML($html);
        $nodes = $doc->getElementsByTagName('title');

//get and display what you need:
        $title = $nodes->item(0)->nodeValue;
        $metas = $doc->getElementsByTagName('meta');

        for ($i = 0; $i < $metas->length; $i++) {
            $meta = $metas->item($i);
            if ($meta->getAttribute('name') == 'description')
                $description = $meta->getAttribute('content');
        }

........

I got the title of page by $title = $nodes->item(0)->nodeValue; but I need to fetch the title of news or content title(my url always not news websites) ,I dont want to restrict myself to some site, I want get the title of content of websites.

as example return Guantanamo must close during Obama's term - Russian Foreign Ministry in http://rt.com/news/guantanamo-closure-russia-dolgov-245/

France to shed ‘Amélie’ image after 50 years of China ties in http://www.france24.com/en/20140127-france-seeks-shed-amelie-image-50-years-after-opening-ties-with-china/

Defining Moments: Capturing our changing world in http://edition.cnn.com/2013/05/01/world/defining-moments/index.html?hpt=hp_bn3

I know usual way is fetch H1 or H2 tags but I need fetch title of some sites that not implement title of news with those and use <div> tag as example

http://www.mehrnews.com/detail/News/2222373

update I test this link in google+ and some another url that the headline is not in <h?> tags, and google return Title correctly, any body know how it work?

Yuseferi
  • 7,931
  • 11
  • 67
  • 103
  • Do you mean the title of the page as it appears in the browser's title bar, or the headline of the page? The one you can get by grabbing the `` element, and I imagine that the other is probably the first `<h1>` tag in the page, though that's less reliable</h1> – andrewsi Jan 27 '14 at 18:38
  • `I need to fetch the title of news or content title(my url always not news websites) ,I dont want to restrict myself to some site, I want get the title of content of websites.` The title on different sites may be in different places, you can't guarantee its location. Although it'll most likely be in the `h1` tag on each page, so just extract it from there? – SubjectCurio Jan 27 '14 at 18:39
  • @andrewsi I means the headline – Yuseferi Jan 27 '14 at 18:42
  • @MLeFevre I think maybe exists good way, because facebook and g+ do this work correctly – Yuseferi Jan 27 '14 at 18:43
  • @zhilevan looks like this other question is doing what you're after http://stackoverflow.com/questions/11733876/how-to-get-content-ot-remote-html-page – SubjectCurio Jan 27 '14 at 18:46
  • @zhilevan no, they don't just take any random site and parse it perfectly. I assure you, there's no magic way of doing that. You can only take a good guess unless the sites tell you exactly what to expect. – m59 Jan 27 '14 at 18:54
  • @MLeFevre I update a question look it again plz – Yuseferi Jan 27 '14 at 20:10
  • @m59 google plus done it correctly , how ? – Yuseferi Jan 27 '14 at 20:10
  • @zhilevan I don't know what you're talking about. Do you mean when you post a link in a comment? – m59 Jan 27 '14 at 20:15
  • @zhilevan the way they do it accurately is they use the `title` meta tag. All 4 websites you linked have the title meta tag as is the correct practice. You asked how to do it otherwise (which means guessing) and I showed you. Your question seems to be about a problem that doesn't exist. – m59 Jan 27 '14 at 20:21
  • @m59 , no, go to google+ and in `Share what's new...` insert one of link that headline is in <`div>` tags ,see what I mean, the google plus get headline correctly – Yuseferi Jan 28 '14 at 05:23
  • @m59 look this link `http://www.mehrnews.com/detail/News/2222373` I inspect html element I dont see any title meta tag that exactly be headline of news !!! – Yuseferi Jan 28 '14 at 05:26
  • @zhilevan I just tested that out in Google+ and it pulls EXACTLY what is in the `` tag into the post. Every site you linked is displaying the title according to the `title` tag. – m59 Jan 28 '14 at 05:30
  • @m59 I test it right know , and it return me title tag,my client told me that it return correctly and when I test some link it return correctly(but I am not sure those didn't have `

    ` tags) , You right ,thanks for your attention and replies. :)

    – Yuseferi Jan 28 '14 at 05:43

1 Answers1

4

Just note that the correct and reliable way is to use the <title> tag. As you asked how you could get the title from other elements, my answer will deal with that.

I don't think there will be any completely reliable way of doing this, but you could have your script take a good guess about what the title is.

The titles you want from each link you posted share a common pattern:

  1. The title is within an <h1> tag.
  2. The h1 tag of the title is the first title on the page that contains ONLY text.

Therefore, a reliable script for those examples would find all the <h1> tags on the page, then consider the "title" to be the first one that contains only text (no nested html elements).

As you are now adding more links in comments - don't expect an answer that will work for everything. The best anyone can do is inspect the elements, search through the page source and try to find a pattern for identifying the titles. For the links in your post, I demonstrated this. If you want to find titles that are displayed other way, you'll have to keep adding in checks for each scenario and hope for the best.

m59
  • 43,214
  • 14
  • 119
  • 136
  • yes, I think but in most of time, use css to big the title instead using `` tags . – Yuseferi Jan 27 '14 at 18:41
  • @zhilevan no, most of the time they are using the `h1` tag. It's beneficial for the website owner to do it that way. Chances are, if the website is using `css to big` the main title instead of placing it in an `h1` tag, it's a website not worth mentioning. – SubjectCurio Jan 27 '14 at 18:44
  • @MLeFevre see this , the title of news is with `
    ` tag :http://www.mehrnews.com/detail/News/2222373
    – Yuseferi Jan 27 '14 at 18:49
  • @zhilevan Literally in the websites you linked in your question, that's what the titles are in. They are the first

    tags on the page that only contain text. If you wanted a more specific answer, you should include the relevant links in your post. It's unfair to keep adding demands.

    – m59 Jan 27 '14 at 18:51
  • @zhilevan It's probably an unfair remark, but I believe that's an example one of a website not worth wasting your time on. @m59's answer is correct, *good* websites will keep to conventions and make use of the `h1` tag or within ``. – SubjectCurio Jan 27 '14 at 19:00
  • @MLeFevre I agree with you , but my client insist do it in some sites that when when I check them they are not implement title by `h1`, but he tell me how facebook and google+ fetch them correctly? – Yuseferi Jan 27 '14 at 19:06