3

I have created a simple PHP script that parses an HTML document and returns meta tags using getElementByTagName and getAttribute. It works perfectly apart from one thing, if the HTML tag is not in lower case then it does not return the content of the tag. For example:

<title>My Title</title>

Will return "My Title" but

<Title>My Title</Title>

or

<TITLE>My Title</TITLE> 

will return nothing. Is there any easy way to get it to match the tag regardless of the case? I'm guessing that it might involve regex.

Sample of code below:

$nodes = $doc->getElementsByTagName('title');
$heading = $doc->getElementsByTagName('h1');
$title = $nodes->item(0)->nodeValue;
$h1 = $heading->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
    $description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
    $keywords = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'robots')
    $robots = $meta->getAttribute('content');
}
heptadecagram
  • 908
  • 5
  • 12
zen_mind
  • 59
  • 1
  • 5
  • Can't you just strtolower the html before you use regex? – Enijar May 02 '14 at 11:47
  • @Enijar: That would change the whole content, not just tags. – Amal Murali May 02 '14 at 11:50
  • Why would you have tags any other than lowercase? It is good practice to only use lowercase (with some exception like doctype). – putvande May 02 '14 at 13:09
  • Where does the value of `$doc` come from? I'd be surprised if whatever is building your DOM doesn't have a (case-insensitive) HTML option. – Quentin May 02 '14 at 13:09
  • @putvande The script is used to scan external sites for the tags. Personally, I always use lower case, but one of my colleagues who was using the tool came across a few sites on which the tool did not work because it is case sensitive. – zen_mind May 02 '14 at 13:36
  • @Quentin the value of $doc comes from earlier in the script, I didn't post the whole thing as it is quite long. This is where the value comes from: $html = file_get_contents_curl("$url"); $doc = new DOMDocument(); @$doc->loadHTML($html); – zen_mind May 02 '14 at 13:39

3 Answers3

2

DOMDocument::loadHtml() converts all elements to lowercase (and removes namespaces). Here is a small demo:

$html = <<<'HTML'
<html><Body><Title>My Title</Title></Body></html>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
echo $dom->saveHtml();

Output: https://eval.in/145538

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><title>My Title</title></body></html>

Xml is case-sensitive. So if you load something as XML it will keep the elements the way they are:

$html = <<<'HTML'
<html><Body><Title>My Title</Title></Body></html>
HTML;

$dom = new DOMDocument();
$dom->loadXml($html);
echo $dom->saveXml();

Output: https://eval.in/145539

<?xml version="1.0"?>
<html><Body><Title>My Title</Title></Body></html>

This will affect the DOM methods and Xpath:

$html = <<<'HTML'
<html><Body><Title>My Title</Title></Body></html>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);

var_dump(
  // One element "title"
  $dom->getElementsByTagName('title')->length
);

$xpath = new DOMXpath($dom);
var_dump(
  // "title" as string
  $xpath->evaluate('string(//title)')
);

Output: https://eval.in/145541

int(1)
string(8) "My Title"
ThW
  • 19,120
  • 3
  • 22
  • 44
0

The answer is no, given what you are using. getElementsByTagName is used for parsing an XML DOM, and XML allows for case-sensitive tag names.

You can go the super-slow route by trying each iteration of Title, tItle, tiTle, etc., but you'll generally just see the three options you mentioned (all-lower, initial-caps, and all-caps), which makes your job a bit easier.

heptadecagram
  • 908
  • 5
  • 12
  • 3
    XML doesn't just "allow" for case-sensitive tag names - it *requires* them. Tag names and attribute names are *always* case-sensitive in XML. – BoltClock May 02 '14 at 13:06
0

An XML document can have two different elements named respectively: Title and title -- that are intended to be different. Converting/treating them as the same name is an error that can have gross consequences.

In your case, though, you can make use of XSLT to translate all uppercase characters to lowercase characters as described in this answer.

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vUpper" select=
 "'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>

 <xsl:variable name="vLower" select=
 "'abcdefghijklmnopqrstuvwxyz'"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match="*[name()=local-name()]" priority="2">
  <xsl:element name="{translate(name(), $vUpper, $vLower)}"
   namespace="{namespace-uri()}">
       <xsl:apply-templates select="node()|@*"/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="*" priority="1">
  <xsl:element name=
   "{substring-before(name(), ':')}:{translate(local-name(), $vUpper, $vLower)}"
   namespace="{namespace-uri()}">
       <xsl:apply-templates select="node()|@*"/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="@*[name()=local-name()]" priority="2">
  <xsl:attribute name="{translate(name(), $vUpper, $vLower)}"
   namespace="{namespace-uri()}">
       <xsl:value-of select="."/>
  </xsl:attribute>
 </xsl:template>

 <xsl:template match="@*" priority="1">
  <xsl:attribute name=
   "{substring-before(name(), ':')}:{translate(local-name(), $vUpper, $vLower)}"
   namespace="{namespace-uri()}">
     <xsl:value-of select="."/>
  </xsl:attribute>
 </xsl:template>
</xsl:stylesheet>
Community
  • 1
  • 1
Aeveus
  • 5,052
  • 3
  • 30
  • 42