7

I'm working on a webcrawler. At the moment i scrape the whole content and then using regular expression i remove <meta>, <script>, <style> and other tags and get the content of the body.

However, I'm trying to optimise the performance and I was wondering if there's a way I could scrape only the <body> of the page?

namespace WebScraper
{
    public static class KrioScraper
    {    
        public static string scrapeIt(string siteToScrape)
        {
            string HTML = getHTML(siteToScrape);
            string text = stripCode(HTML);
            return text;
        }

        public static string getHTML(string siteToScrape)
        {
            string response = "";
            HttpWebResponse objResponse;
            HttpWebRequest objRequest = 
                (HttpWebRequest) WebRequest.Create(siteToScrape);
            objRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; " +
                "Windows NT 5.1; .NET CLR 1.0.3705)";
            objResponse = (HttpWebResponse) objRequest.GetResponse();
            using (StreamReader sr =
                new StreamReader(objResponse.GetResponseStream()))
            {
                response = sr.ReadToEnd();
                sr.Close();
            }
            return response;
        }

        public static string stripCode(string the_html)
        {
            // Remove google analytics code and other JS
            the_html = Regex.Replace(the_html, "<script.*?</script>", "", 
                RegexOptions.Singleline | RegexOptions.IgnoreCase);
            // Remove inline stylesheets
            the_html = Regex.Replace(the_html, "<style.*?</style>", "", 
                RegexOptions.Singleline | RegexOptions.IgnoreCase);
            // Remove HTML tags
            the_html = Regex.Replace(the_html, "</?[a-z][a-z0-9]*[^<>]*>", "");
            // Remove HTML comments
            the_html = Regex.Replace(the_html, "<!--(.|\\s)*?-->", "");
            // Remove Doctype
            the_html = Regex.Replace(the_html, "<!(.|\\s)*?>", "");
            // Remove excessive whitespace
            the_html = Regex.Replace(the_html, "[\t\r\n]", " ");

            return the_html;
        }
    }
}

From Page_Load I call the scrapeIt() method passing to it the string that I get from a textbox from the page.

Johancho
  • 105
  • 2
  • 6

3 Answers3

5

I'd suggest taking advantage of the HTML Agility Pack to do the HTML parsing/manipulation.

You can easily select the body like this:

var webGet = new HtmlWeb();
var document = webGet.Load(url);
document.DocumentNode.SelectSingleNode("//body")
carla
  • 1,970
  • 1
  • 31
  • 44
Joel Beckham
  • 18,254
  • 3
  • 35
  • 58
  • Hey Joel, thanks for taking the time to help you. How would HtmlAgilityPack be of help to me? Don't I have to load the page first and then parse the string? – Johancho Aug 16 '11 at 17:57
  • The agility pack can load and parse the page for you. I've updated my example. Parsing html yourself can be a major pain, especially if it isn't perfectly formed. The agility pack is really good at it. – Joel Beckham Aug 16 '11 at 17:59
  • The agility pack will need to load and parse the page before hand which will add extra overhead. While it's a simple and accurate solution it is NOT fast or efficient. – Louis Ricci Aug 16 '11 at 18:01
  • Good point. You'd just have to test it and see if it's too slow for your needs. – Joel Beckham Aug 16 '11 at 18:06
5

Still the simplest/fastest (least accurate) method.

int start = response.IndexOf("<body", StringComparison.CurrentCultureIgnoreCase);
int end = response.LastIndexOf("</body>", StringComparison.CurrentCultureIgnoreCase);
return response.Substring(start, end-start + "</body>".Length);

Obviously if there's javascript in the HEAD tag like...

document.write("<body>");

Then you'll end up with a little more then you wanted.

Louis Ricci
  • 20,804
  • 5
  • 48
  • 62
  • +1 for adding an answer that is simple and fast for a quick job. Not everyone wants to download and deploy frameworks, especially for one-time use. Not sure why this was downvoted. . . – Matt Cashatt Aug 23 '12 at 14:02
3

I think that your best option is to use a lightweight HTML parser (something like Majestic 12, which based on my tests is roughly 50-100% faster than HTML Agility Pack) and only process the nodes which you're interested in (anything between <body> and </body>). Majestic 12 is a little harder to use than HTML Agility Pack, but if you're looking for performance then it will definitely help you!

This will get you the closes to what you're asking for, but you will still have to download the entire page. I don't think there is a way around that. What you will save on is actually generating the DOM nodes for all the other content (aside from the body). You will have to parse them, but you can skip the entire content of a node which you're not interested in processing.

Here is a good example of how to use the M12 parser.

I don't have a ready example of how to grab the body, but I do have one of how to only grab the links and with little modification it will get there. Here is the rough version:

GrabBody(ParserTools.OpenM12Parser(_response.BodyBytes));

You need to Open the M12 Parser (the example project that comes with M12 has comments that detail exactly how all of these options affect performance, AND THEY DO!!!):

public static HTMLparser OpenM12Parser(byte[] buffer)
{
    HTMLparser parser = new HTMLparser();
    parser.SetChunkHashMode(false);
    parser.bKeepRawHTML = false;
    parser.bDecodeEntities = true;
    parser.bDecodeMiniEntities = true;

    if (!parser.bDecodeEntities && parser.bDecodeMiniEntities)
        parser.InitMiniEntities();

    parser.bAutoExtractBetweenTagsOnly = true;
    parser.bAutoKeepScripts = true;
    parser.bAutoMarkClosedTagsWithParamsAsOpen = true;
    parser.CleanUp();
    parser.Init(buffer);
    return parser;
}

Parse the body:

public void GrabBody(HTMLparser parser)
{

    // parser will return us tokens called HTMLchunk -- warning DO NOT destroy it until end of parsing
    // because HTMLparser re-uses this object
    HTMLchunk chunk = null;

    // we parse until returned oChunk is null indicating we reached end of parsing
    while ((chunk = parser.ParseNext()) != null)
    {
        switch (chunk.oType)
        {
            // matched open tag, ie <a href="">
            case HTMLchunkType.OpenTag:
                if (chunk.sTag == "body")
                {
                    // Start generating the DOM node (as shown in the previous example link)
                }
                break;

            // matched close tag, ie </a>
            case HTMLchunkType.CloseTag:
                break;

            // matched normal text
            case HTMLchunkType.Text:
                break;

            // matched HTML comment, that's stuff between <!-- and -->
            case HTMLchunkType.Comment:
                break;
        };
    }
}

Generating the DOM nodes is tricky, but the Majestic12ToXml class will help you do that. Like I said, this is by no means equivalent to the 3-liner you saw with HTML agility pack, but once you get the tools down you will be able to get exactly what you need for a fraction of the performance cost and probably just as many lines of code.

Community
  • 1
  • 1
Kiril
  • 39,672
  • 31
  • 167
  • 226
  • +1: Nice. I didn't know about Majest 12. I'll have to check it out. – Joel Beckham Aug 16 '11 at 18:03
  • @Lirik: I'd like to check it out as well; you say it's more difficult, can you point to anything on how different it is? I can't see any online documentation or samples. – casperOne Aug 16 '11 at 18:07
  • Thanks Lirik. THe only thing is I cannot find documentation or the API to use this library. Could you point me to a link? – Johancho Aug 16 '11 at 18:08
  • If you get a chance, could you update your answer with an example of grabbing the body? I'm curious to see how it works. – Joel Beckham Aug 16 '11 at 18:10
  • @Johancho, unfortunately the documentation is kinda sparse, but it does come with an example project which shows you how the basics work. Check out the files on the M12 page: http://www.majestic12.co.uk/projects/html_parser.php It took me about a day to figure out how to use it and it's well worth the time spent (note I had no previous experience in HTML/DOM parsing). – Kiril Aug 16 '11 at 18:23
  • @casperOne, when I say it's more "difficult" I mean that you have to do some of the things by yourself, or at least you have to create some helper classes that will do them for you. Once you have your helper classes done, then everything else becomes a piece of cake. See the examples I've posted for more details. It's got a small learning curve, but it's well worth the time. – Kiril Aug 16 '11 at 18:32
  • @Lirik: Curious, one of the things that I find tremendously useful is using HTML Agility Pack with XPath-like expressions to be able to scrape specific content from my HTML. What are your thoughts on the tradeoff between ease of use (in this specific case, finding specific sets of nodes repeatedly in the same document) and performance? If performance isn't impacting the process negatively, do you still think that Majest 12 is a good solution? Note, this isn't a slight in any way, I just want to understand the capabilities and the best applications of each better. – casperOne Aug 18 '11 at 13:41
  • @casperOne: It's fairly trivial: you get your [helper class that generates the XML nodes](http://stackoverflow.com/questions/100358/looking-for-c-html-parser/624410#624410), then you can just call the `CreateNavigator()` method on the resulting `XNode` and you will get an `XPathNavigator`, then `XPath` all your heart desires :). Once you have an `XNode` for the root of your document, then I don't see any difference in ease of use or performance, but you did generate the `XNode` about 50-100% faster than you would have with HTML AP. In my opinion that warrants the use of M12 over AP. – Kiril Aug 18 '11 at 14:42