0

I'm trying to get the title of articles from websites. It works for some website but not for all.

I've tried with BBC news and it works. When I tried with NYtimes article, it doesnt give a good title even though I can see in the source code that the title tag exists with the right title.

Here is the code

$titre = preg_match('/<title>(.+)<\/title>/',file_get_contents($url),$matches);
echo $matches[1];

when I try http://www.bbc.com/news/business-30512079 it works

when I try enter link description here it gives Log In - The New York Times

klark
  • 484
  • 1
  • 10
  • 27
  • Look at the output from file_get_contents. – Matt K Dec 17 '14 at 18:29
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Mike Dec 17 '14 at 18:42

1 Answers1

0

New York Times uses a paygate which will redirect you to a login / sign up page after a certain number of requests. I'm guessing your scraper is hitting this paygate.

AlpineCoder
  • 627
  • 4
  • 8
  • I'm just reading the html page and when I go 'manually' to the site, it does ask me to log in. Do you think that they automatically the way of accessing the page and redirect me automatically? – klark Dec 17 '14 at 18:31
  • I think their paygate implementation is based on IP / UA string identification to redirect clients to the paygate page after some number (5 maybe?) of requests to the article pages. Your scraper is getting redirected to the login page on the server, and never sees the article page / title. – AlpineCoder Dec 17 '14 at 18:33
  • I dont try to scrap, I just want to fetch part of article in Facebook like way. I've seen codes for that but it does not work for all kind of website. – klark Dec 18 '14 at 18:19
  • Facebook uses a a scraper to generate the metadata for links that they display as well. However, it doesn't work with all websites either. It identifies itself with a particular user agent string, and most sites (who want to have content shared on FB) allow access to their pages from the FB bot. – AlpineCoder Dec 18 '14 at 18:30