1

I need to look at whether or not html page markup has included google analytics within the source, in a script block, and also that the script block is above the <title> tag.

I've managed to get the source code of the webpage into a variable.

I'm struggling to write the correct regex to pull out the google analytics section of code to know whether firstly it's present and secondly that the javascript is before the tag!

Any suggestions?

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
James Radford
  • 1,815
  • 4
  • 25
  • 40

1 Answers1

3

Avoid using regex to parse html; there are way too many pitfalls. Suppose you search for the string "<title" in your document. What if you don't find "<TITLE" . Ok, easy to do case-insensitive matches. But... what if there is a "<title" string embedded in a comment? What if there is such a string embedded in a script block? etc etc.

Any "search" of an HTML document needs to do more than simply text search. It needs to be document-aware. And that's what the HtmlAgilityPack provides. It's a free download.

Start with something like this:

using HtmlAgilityPack; 
   ....

    HtmlDocument doc = new HtmlDocument();
    doc.Load(fileName);
    var titles = doc.DocumentNode.SelectNodes("/html/head/title");
    if (titles != null)
    {
        foreach(var title in titles)
        {
            Console.WriteLine("<title> on line: " + title.Line);
        }
        var scripts = doc.DocumentNode.SelectNodes("/html/head/script");
        if (scripts != null)
        {
            foreach(var script in scripts)
            {
                Console.WriteLine("<script> on line: " + script.Line);
                // here, you need to decide if the script is before the title
                // and if it is the "right" script - google analytics. 
                // you have to do that part yourself.
            }
        }
        else
        {
            Console.WriteLine("No script nodes found.");
        }
    }
    else
    {
        Console.WriteLine("No title node found.");
    }
Cheeso
  • 189,189
  • 101
  • 473
  • 713