Case sensitivity with getElementByTagName and getAttribute - PHP

Question

I have created a simple PHP script that parses an HTML document and returns meta tags using getElementByTagName and getAttribute. It works perfectly apart from one thing, if the HTML tag is not in lower case then it does not return the content of the tag. For example:

<title>My Title</title>

Will return "My Title" but

<Title>My Title</Title>

or

<TITLE>My Title</TITLE>

will return nothing. Is there any easy way to get it to match the tag regardless of the case? I'm guessing that it might involve regex.

Sample of code below:

$nodes = $doc->getElementsByTagName('title');
$heading = $doc->getElementsByTagName('h1');
$title = $nodes->item(0)->nodeValue;
$h1 = $heading->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
    $description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
    $keywords = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'robots')
    $robots = $meta->getAttribute('content');
}

@Enijar: That would change the whole content, not just tags. — Amal Murali, May 02 '14 at 11:50
Why would you have tags any other than lowercase? It is good practice to only use lowercase (with some exception like doctype). — putvande, May 02 '14 at 13:09
Where does the value of `$doc` come from? I'd be surprised if whatever is building your DOM doesn't have a (case-insensitive) HTML option. — Quentin, May 02 '14 at 13:09
@putvande The script is used to scan external sites for the tags. Personally, I always use lower case, but one of my colleagues who was using the tool came across a few sites on which the tool did not work because it is case sensitive. — zen_mind, May 02 '14 at 13:36
@Quentin the value of $doc comes from earlier in the script, I didn't post the whole thing as it is quite long. This is where the value comes from: $html = file_get_contents_curl("$url"); $doc = new DOMDocument(); @$doc->loadHTML($html); — zen_mind, May 02 '14 at 13:39

score 2 · Answer 1 · answered May 02 '14 at 14:10

DOMDocument::loadHtml() converts all elements to lowercase (and removes namespaces). Here is a small demo:

$html = <<<'HTML'
<html><Body><Title>My Title</Title></Body></html>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
echo $dom->saveHtml();

Output: https://eval.in/145538

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><title>My Title</title></body></html>

Xml is case-sensitive. So if you load something as XML it will keep the elements the way they are:

$html = <<<'HTML'
<html><Body><Title>My Title</Title></Body></html>
HTML;

$dom = new DOMDocument();
$dom->loadXml($html);
echo $dom->saveXml();

Output: https://eval.in/145539

<?xml version="1.0"?>
<html><Body><Title>My Title</Title></Body></html>

This will affect the DOM methods and Xpath:

$html = <<<'HTML'
<html><Body><Title>My Title</Title></Body></html>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);

var_dump(
  // One element "title"
  $dom->getElementsByTagName('title')->length
);

$xpath = new DOMXpath($dom);
var_dump(
  // "title" as string
  $xpath->evaluate('string(//title)')
);

Output: https://eval.in/145541

int(1)
string(8) "My Title"

score 0 · Answer 2 · answered May 02 '14 at 12:58

0

The answer is no, given what you are using. getElementsByTagName is used for parsing an XML DOM, and XML allows for case-sensitive tag names.

You can go the super-slow route by trying each iteration of Title, tItle, tiTle, etc., but you'll generally just see the three options you mentioned (all-lower, initial-caps, and all-caps), which makes your job a bit easier.

answered May 02 '14 at 12:58

heptadecagram

908
5
12

3

XML doesn't just "allow" for case-sensitive tag names - it *requires* them. Tag names and attribute names are *always* case-sensitive in XML. – BoltClock May 02 '14 at 13:06

score 0 · Answer 3 · edited May 23 '17 at 12:32

An XML document can have two different elements named respectively: Title and title -- that are intended to be different. Converting/treating them as the same name is an error that can have gross consequences.

In your case, though, you can make use of XSLT to translate all uppercase characters to lowercase characters as described in this answer.

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vUpper" select=
 "'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>

 <xsl:variable name="vLower" select=
 "'abcdefghijklmnopqrstuvwxyz'"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match="*[name()=local-name()]" priority="2">
  <xsl:element name="{translate(name(), $vUpper, $vLower)}"
   namespace="{namespace-uri()}">
       <xsl:apply-templates select="node()|@*"/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="*" priority="1">
  <xsl:element name=
   "{substring-before(name(), ':')}:{translate(local-name(), $vUpper, $vLower)}"
   namespace="{namespace-uri()}">
       <xsl:apply-templates select="node()|@*"/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="@*[name()=local-name()]" priority="2">
  <xsl:attribute name="{translate(name(), $vUpper, $vLower)}"
   namespace="{namespace-uri()}">
       <xsl:value-of select="."/>
  </xsl:attribute>
 </xsl:template>

 <xsl:template match="@*" priority="1">
  <xsl:attribute name=
   "{substring-before(name(), ':')}:{translate(local-name(), $vUpper, $vLower)}"
   namespace="{namespace-uri()}">
     <xsl:value-of select="."/>
  </xsl:attribute>
 </xsl:template>
</xsl:stylesheet>

Case sensitivity with getElementByTagName and getAttribute - PHP

3 Answers3