What is the right way to find javascript in mark up and determine whether or not it lies above the title tag?

Question

I need to look at whether or not html page markup has included google analytics within the source, in a script block, and also that the script block is above the <title> tag.

I've managed to get the source code of the webpage into a variable.

I'm struggling to write the correct regex to pull out the google analytics section of code to know whether firstly it's present and secondly that the javascript is before the tag!

Any suggestions?

Use some html parser (HtmlAgilityPack for example) to do this. — Łukasz Wiatrak, Oct 24 '11 at 15:14
Please don't try using regex to parse html. I second using the HtmlAgilityPack. — , Oct 24 '11 at 16:08
Need I say more: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 :) — Rune FS, Oct 24 '11 at 17:57

score 3 · Accepted Answer · answered Oct 24 '11 at 17:54

Avoid using regex to parse html; there are way too many pitfalls. Suppose you search for the string "<title" in your document. What if you don't find "<TITLE" . Ok, easy to do case-insensitive matches. But... what if there is a "<title" string embedded in a comment? What if there is such a string embedded in a script block? etc etc.

Any "search" of an HTML document needs to do more than simply text search. It needs to be document-aware. And that's what the HtmlAgilityPack provides. It's a free download.

Start with something like this:

using HtmlAgilityPack; 
   ....

    HtmlDocument doc = new HtmlDocument();
    doc.Load(fileName);
    var titles = doc.DocumentNode.SelectNodes("/html/head/title");
    if (titles != null)
    {
        foreach(var title in titles)
        {
            Console.WriteLine("<title> on line: " + title.Line);
        }
        var scripts = doc.DocumentNode.SelectNodes("/html/head/script");
        if (scripts != null)
        {
            foreach(var script in scripts)
            {
                Console.WriteLine("<script> on line: " + script.Line);
                // here, you need to decide if the script is before the title
                // and if it is the "right" script - google analytics. 
                // you have to do that part yourself.
            }
        }
        else
        {
            Console.WriteLine("No script nodes found.");
        }
    }
    else
    {
        Console.WriteLine("No title node found.");
    }

What is the right way to find javascript in mark up and determine whether or not it lies above the title tag?

1 Answers1