How to scrape all pages of a website and get the meta description in php

Question

I want to scrape all pages of a website and get the meta tag description like

<meta name="description" content="I want to get this description of this meta tag" />

similarly for all other pages I want to get their individual meta description

Here is my code

add_action('woocommerce_before_single_product', 'my_function_get_description');

function my_function_get_description($url) {
   $the_html = file_get_contents('https://tipodense.dk/');
   print_r($the_html)
}

Thisprint_r($the_html) gives me the whole website, I don't know how to get the meta description of each page

Kindly guide me thanks

How can I do that? how can I have the meta description of all pages? — mehmood khan, Dec 01 '22 at 09:50
When scraping a website respect their [robot.txt](https://moz.com/learn/seo/robotstxt) file and [limit the rate](https://www.zyte.com/learn/web-scraping-best-practices/) of your requests. — KIKO Software, Dec 01 '22 at 09:54

score 1 · Answer 1 · answered Dec 01 '22 at 09:54

You have to look about preg_match and regex expression. Here it's quite simple :

function my_function_get_description($url) {
    $the_html = file_get_contents('https://tipodense.dk/');
    preg_match('meta name="description" content="([\w\s]+)"', $the_html, $matches);
    print_r($matches);
}

https://regex101.com/r/JMcaUh/1

The description is catched by capturing group () and saved in $matches[0][1]

EDIT : DOMDocument is a great solution too, but assuming you only want description, using regex looks easier to me !

Just be aware of the [possible pitfalls of using a RegEx to parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454) — Professor Abronsius, Dec 01 '22 at 10:10

Professor Abronsius · Accepted Answer · 2022-12-01T10:21:07.117

The better way to parse an HTML file is to use DOMDocument and, in many cases, combine that with DOMXPath to run queries on the DOM to find elements of interest.

For instance, in your case to extract the meta description you could do:

$url='https://tipodense.dk/';

# create the DOMDocument and load url
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->loadHTMLFile( $url );
libxml_clear_errors();

# load XPath
$xp=new DOMXPath( $dom );
$expr='//meta[@name="description"]';


$col=$xp->query($expr);
if( $col && $col->length > 0 ){
    foreach( $col as $node ){
        echo $node->getAttribute('content');
    }
}

Which yields:

Har du brug for at vide hvad der sker i Odense? Vores fokuspunkter er især events, mad, musik, kultur og nyheder. Hvis du vil vide mere så læs med på sitet.

Using the sitemap ( or part of it ) you could do like this:

$url='https://tipodense.dk/';
$sitemap='https://tipodense.dk/sitemap-pages.xml';

$urls=array();


# create the DOMDocument and load url
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;


# read the sitemap & store urls
$dom->load( $sitemap );
libxml_clear_errors();

$col=$dom->getElementsByTagName('loc');
foreach( $col as $node )$urls[]=$node->nodeValue;



foreach( $urls as $url ){
    
    $dom->loadHTMLFile( $url );
    libxml_clear_errors();
    
    # load XPath
    $xp=new DOMXPath( $dom );
    $expr='//meta[@name="description"]';
    
    
    $col=$xp->query( $expr );
    if( $col && $col->length > 0 ){
        foreach( $col as $node ){
            printf('<div>%s: %s</div>', $url, $node->getAttribute('content') );
        }
    }
}

I need to get the description of all pages, not single page, how can I do this? — mehmood khan, Dec 01 '22 at 09:56
like if the website contains 10 pages, I need all the 10 descriptions of each page — mehmood khan, Dec 01 '22 at 09:58
If you have a pre-compiled list of page urls you iterate through that list and use the above to read the meta tag. If you do **not** have a pre-compiled list you would need to try to identify other pages, perhaps by reading `robots.txt` file if it exists ( it does but is not very useful ) or by scanning the initial page for hyperlinks that relate to the same domain and then scanning each page using the list you just compiled. You **might** be able to read the sitemap file and use that as the basis of the scan — Professor Abronsius, Dec 01 '22 at 09:59
how can I have the other pages description, i don't have pre-compiled list of page urls — mehmood khan, Dec 01 '22 at 10:02
You can use the website sitemap as a reference to get all the available url : https://tipodense.dk/sitemap.xml — Camille, Dec 01 '22 at 10:18
I have 11 `urls` but I got 7 only, what could be the reason? I'm using the above code — mehmood khan, Dec 02 '22 at 08:45

How to scrape all pages of a website and get the meta description in php

2 Answers2