0

I have a newsfeed link from an Indian newspaper as follows:

https://www.hindustantimes.com/rss/cities/delhi/rssfeed.xml

I am trying to extract some information from it using PHP and simpleXML

            $feedURL="https://www.hindustantimes.com/rss/cities/delhi/rssfeed.xml";

            $array = get_headers($feedURL);
            $statusCode = $array[0];
            echo('<br>'.$statusCode.'<br>');

            if (strpos($statusCode, "404")==FALSE) {
                echo('Reading <a href="' . $feedURL . '">' . $feedURL . '</a><br>');
                $out = htmlspecialchars(file_get_contents($feedURL), ENT_QUOTES);
                echo($out);
                if (stripos($out, "&lt;feed ") != FALSE) {
                    $feedType = 'ATOM';
                    $countATOM += 1;
                } else if (stripos($out, "&lt;rss") != FALSE) {
                    $feedType = 'RSS';
                    $countRSS += 1;
                } else {
                    $feedType = 'UNREADABLE';
                    $countUNREADABLE += 1;
                }

                echo('<br>' . $feedType . '<br>');
                echo('<br>-------------------------------------------------------------------------<br>');
                if ($feedType == 'ATOM') {
                    $xmlOut = simplexml_load_string(file_get_contents($feedURL));
                    echo($xmlOut.'<br>-------------------------------------------------------------------------<br>');
                    if ($xmlOut === false) {
                        echo("Failed loading XML: ");
                        foreach (libxml_get_errors() as $error) {
                            echo ("<br>" . $error->message);
                        }
                    } else {
                        foreach ($xmlOut->entry as $entry) {
                            if (isset($xmlOut->entry->title) && isset($xmlOut->entry->link) && isset($xmlOut->entry->updated) && isset($xmlOut->entry->summary)){
                                $title=$xmlOut->entry->title;
                                $link=$title=$xmlOut->entry->link['href'];
                                $updated=$xmlOut->entry->updated;
                                $summary=$xmlOut->entry->summary;
                                if(isImportantNews($title) || isImportantNews($summary)){
                                    $insertNewsCmd=$insertNewsCmd
                                            ."('".$link."',"
                                            ."'".stripSpecialChars($title)."',"
                                            ."'".setDate($updated)."'),";
                                }
                            }
                                echo($entry->updated . "<br>");
                        }
                        
                    }
                } elseif ($feedType == 'RSS') {
                    $xmlOut = simplexml_load_string(file_get_contents($feedURL));
                    print_r($xmlOut);
                    echo('<br>-------------------------------------------------------------------------<br>');
                    if ($xmlOut === false) {
                        echo("Failed loading XML: ");
                        foreach (libxml_get_errors() as $error) {
                            echo ("<br>" . $error->message);
                        }
                    } else {
                        foreach ($xmlOut->channel->item as $item) {
                            if (isset($item->title) && isset($item->link) && isset($item->description) && isset($item->pubDate)) {
                                $title = $item->title;
                                $link = $item->link;
                                $descr = $item->description;
                                $pubDate = $item->pubDate;
                                echo($title.'<br>'.$link.'<br>'.$descr.'<br>');
                                echo('<br>-------------------------------------------------------------------------<br>');
                                if(isImportantNews($title) || isImportantNews($descr)){
                                    $insertNewsCmd=$insertNewsCmd
                                            ."('".$link."',"
                                            ."'".stripSpecialChars($title)."',"
                                            ."'".setDate($pubDate)."'),";
                                }
                                
                                echo($entries->pubDate. "<br>");
                            }
                        }
                    }
                } else {
                    continue;
                }
                break;
            } else {
                echo($feedURL . ' encountered problems being read...' . '<br>');
            }

Basically what I am doing in the program is that I am using the above link (after determining if it is ATOM or RSS) to extract the news summary and description and determine if it is important news using the isImportantNews() method. If so, I store it in a database.

My problem is that if I open the above link in a browser directly, I can get to see the information without any issues but trying to read it using the above code returns a HTTP 403 Forbidden status code

Why is this happening and is there a way to get around this issue? Being able to open it directly tells me that the 403 maybe coming up due to programatic access attempt (?) But I am not certain about it. I also tried the following ways to read it with the same expected failure

    echo('read file ####################################################################################################');
    echo readfile("https://www.hindustantimes.com/rss/cities/delhi/rssfeed.xml");            //needs "Allow_url_include" enabled
    echo('<br>include ####################################################################################################');
    echo include("https://www.hindustantimes.com/rss/cities/delhi/rssfeed.xml");             //needs "Allow_url_include" enabled
    echo('<br>file get contents ####################################################################################################');
    echo file_get_contents("https://www.hindustantimes.com/rss/cities/delhi/rssfeed.xml");
    echo('<br>stream get contents####################################################################################################');
    echo stream_get_contents(fopen('https://www.hindustantimes.com/rss/cities/delhi/rssfeed.xml', "r")); //you may use "r" instead of "rb"  //needs "Allow_url_fopen" enabled
    echo('<br>get remote data ####################################################################################################');
    echo get_remote_data('https://www.hindustantimes.com/rss/cities/delhi/rssfeed.xml');
    $feedURL = "https://www.hindustantimes.com/rss/cities/delhi/rssfeed.xml";
    $out = htmlspecialchars(file_get_contents($feedURL), ENT_QUOTES);
    echo($out);

Any help or insight would be most appreciated.

  • Send a User-Agent header that mimics a current browser. – CBroe Dec 09 '20 at 07:25
  • @CBroe could you kindly elaborate? – Siddhartha Bhuyan Dec 09 '20 at 07:43
  • Setting extra request headers when using `file_get_contents` can be done using stream context options, https://stackoverflow.com/a/13969212/1427878 – CBroe Dec 09 '20 at 07:48
  • Thanks @CBroe will have a look at it and let you know how it goes – Siddhartha Bhuyan Dec 09 '20 at 08:21
  • Rule #1 of debugging: Break The Problem Down. You've shown us a whole bunch of code involving reading XML, formatting output, etc. But your actual question seems to be about fetching the URL, so you only needed to show us the line `file_get_contents($feedURL)`. You then say something about a 403 response, but don't show us where you're seeing this, or if it gives you any other output, or logs any errors. The best we can do at this point is ignore most of the details you have posted and guess at what you've left out, which isn't a particularly good use of our time (or yours!). – IMSoP Dec 11 '20 at 09:02
  • Thanks for the suggestions @IMSop . I am still learning to post questions on this forum in a way which helps in reaching a resolution ASAP. I shall incorporate the points you mentioned at the earliest – Siddhartha Bhuyan Dec 14 '20 at 06:59

0 Answers0