0

I am new in data scraping, i am working on url to title scraping, actually i want to write a funtion that take a url/link as request and in return i get <title> </title> , og:title , og:description etc. all meta property

i am trying this funtion for scrape only title

/**
     * @param Request $request
     * @return \Illuminate\Http\JsonResponse
     *
     * @throws ValidationException
     */
    public function getTitle(Request $request)
    {
        $this->validate($request, [
            'link' => 'required',
        ]);

        $link = $request->input('link');

        $str = @file_get_contents($link);
        if(strlen($str)>0){
            $str = trim(preg_replace('/\s+/', ' ', $str));
            preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title);
            $result = $title[1];
        }

        return Response::json([
            'message' => 'Get title',
            'data'    => $result,
        ], \Symfony\Component\HttpFoundation\Response::HTTP_OK);
    }

route

Route::post('request-title', 'BuyShipRequestController@getTitle');

Example what i request in input field:

Amazon-url

and what i want to my reutrn response

<title>Amazon.com: Seagate Portable 2TB External Hard Drive Portable HDD – USB 3.0 for PC, Mac, PS4, &amp; Xbox - 1-Year Rescue Service (STGX2000400): Computers &amp; Accessories</title>

and

<meta name="description"/> , <meta name="title"/>, <meta name="keywords" /> , link 

in return response i want only those meta properties content or value

Koushik Saha
  • 357
  • 1
  • 3
  • 11
  • 1
    Does this answer your question? [How to parse HTML in PHP?](https://stackoverflow.com/questions/18349130/how-to-parse-html-in-php) It's not clear what your question is. Can you describe what `getTitle()` is currently doing incorrectly? Where do you need help? See https://stackoverflow.com/help/how-to-ask for more information. – WOUNDEDStevenJones Apr 15 '21 at 17:38
  • no i want any link, any web link to get title and all meta properties – Koushik Saha Apr 15 '21 at 17:41
  • ```getTitle()``` is not incorrect it only return title of any link sometime it didn't return any `````` i want all meta properties and also ```link``` and ```title``` – Koushik Saha Apr 15 '21 at 17:43
  • i want help to find meta properties ```content ``` of any link with ```title``` and ```link``` – Koushik Saha Apr 15 '21 at 17:48
  • 1
    "Is there a PHP function for this?" No. Instead, like others have said, you'll have to parse the HTML. – Chris Haas Apr 15 '21 at 17:54
  • yes parse the HTML – Koushik Saha Apr 15 '21 at 17:56
  • https://www.php.net/manual/en/function.get-meta-tags.php might work for you. Though it may not do 100% what you're attempting, so you might end up needing to parse the HTML as mentioned above anyway. – WOUNDEDStevenJones Apr 15 '21 at 18:02
  • Since your `file_get_contents()` is working and your json return is irrelevant to your task, two thirds of your snippet can be safely removed. Please create a [mcve] which provides realistic sample input, your coding attempt, and your exact desired output from the sample input (we don't want to link-chase, just give us static input data in the question). This isn't a laravel task. We don't need the route. Not only should you not use regex for html parsing, `preg_replace('/\s+/', ' ', $str)` will potentially allow `.*` to match FAR more than it should. – mickmackusa Apr 15 '21 at 21:16
  • Hope this package help PHP Simple HTML DOM Parser Using this package you will be able to parse HTML simply like jQuery. See documentation https://packagist.org/packages/sunra/php-simple-html-dom-parser https://simplehtmldom.sourceforge.io/ – ibra Apr 15 '21 at 22:19

1 Answers1

3

A pretty straightforward way without the need to use external libraries would be to use XPath to query an HTML document:

XPath expression Result
//div Returns all div tags
//meta Returns all meta tags
//meta[@name] Returns all meta tags having a 'name' attribute

In PHP, XPath is available via DomXPath. Since XPath works on a DOM tree, we'd need a DomDocument first:

$dom = new DomDocument;
$dom->loadHTML($some_html);

$xpath = new DomXPath($dom);
$xpath->query(".//meta"); 

So, given the document you've provided...

$html = file_get_contents('amazon.html'); 

...we could write up a basic function to query it for a set of tags:

function get_from_html(string $html, array $tags) {

    $collect = [];

    // Turn off default error reporting so we're not drowning  
    // in errors when the HTML is malformed. We can get a 
    // hold of them anytime via libxml_get_errors().
    // Cf. https://www.php.net/libxml_use_internal_errors
    libxml_use_internal_errors(true);
    
    // Turn HTML string into a DOM tree.    
    $dom = new DomDocument;
    $dom->loadHTML($html);
    
    // Set up XPath
    $xpath = new DomXPath($dom);

    // Query the DOM tree for the given set of tags.
    foreach ($tags as $tag) {

        // You can do *a lot* more with XPath, cf. this cheat sheet:
        // https://gist.github.com/LeCoupa/8c305ec8c713aad07b14 
        $result = $xpath->query("//{$tag}"); 

        if ($result instanceof DOMNodeList) {

            $collect[$tag] = $result;
        }
    }

    // Clear errors to free up memory, cf.
    // https://www.php.net/manual/de/function.libxml-use-internal-errors.php#78236
    libxml_clear_errors();

    return $collect;
}

When invoking it ...

$results = get_from_html($html, ['title', 'meta']);

...it returns an array of iterable DOMNodeList objects, which you could easily evaluate further (for example, to examine the attributes of all nodes in the list):

// For demonstration purposes, just walk the results and turn each found node
// back to its HTML representation.
// 
// For real world stuff, cf.:
// - https://www.php.net/manual/en/class.domnodelist.php
// - https://www.php.net/manual/en/class.domnode.php
// - https://www.php.net/manual/en/class.domelement.php
if (!empty($results)) {

    foreach ($results as $key => $nodes) {

        if ($key == 'title') {

            $node = $nodes->item(0);
            
            // Get HTML, cf. https://stackoverflow.com/a/12909924/3323348
            // Output: <title>Amazon.com: Seagate (...)</title>
            var_dump($node->ownerDocument->saveHTML($node));
        }

        if ($key == 'meta') {

            foreach ($nodes as $node) {

                // Get HTML, cf. https://stackoverflow.com/a/12909924/3323348
                // Output: <meta (...)>
                var_dump($node->ownerDocument->saveHTML($node));

                // Or get an attribute
                if ($node->hasAttribute('name')) {
 
                    // Output: "keywords", or "description", or...  
                    var_dump($node->getAttribute('name'));
                }
            }
        }
    }
}

On XPath:

nosurs
  • 680
  • 6
  • 13