0

I want to be able to search a html document, scraped from a URL, and verify that the URL contains specific text. Both the text and URL are supplied by the user, and can vary. I scrape the URL with a httpWeb Request

string quote = txtQuote.Text;
string sourceURL = txtURL.Text;
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(sourceURL);
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();

    if (response.StatusCode == HttpStatusCode.OK)
    {
        Stream receiveStream = response.GetResponseStream();
        StreamReader readStream = null;

        if (response.CharacterSet == null)
        {
            readStream = new StreamReader(receiveStream);
        }
        else
        {
readStream = new StreamReader(receiveStream,     
Encoding.GetEncoding(response.CharacterSet));
        }

        string data = readStream.ReadToEnd();


        response.Close();
        readStream.Close();

I also have a list of html entities and the various possible encodings in my database, which I retrieve and pass to a DataTable so I can change any encodings to the standard html entity and replace non breaking spaces with a standard space

DataTable encodings = new DataTable();
        string getEncodings = "select * from htmlentities";
        SqlCommand cmdGetEncodings = new SqlCommand(getEncodings, dbcon);
        encodings.Load(cmdGetEncodings.ExecuteReader());
        dbcon.Close();

        foreach (DataRow row in encodings.Rows)
        {
            string htmlentity = row[1].ToString();
            string deccode = row[2].ToString();
            string hexcode = row[3].ToString();

            data = data.Replace(deccode, htmlentity);
            data = data.Replace(hexcode, htmlentity);
      data = data.Replace(“ ”, “ “);
        }

And I then use htmlAgilityPack to pass the scraped and amended html to a new doc, and retrieve the inner text HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(data);

        HtmlNode root = doc.DocumentNode;
        string innerText = root.InnerText;

Now I’m wondering, what is the best way to accurately verify if quote is contained within innerText? One way I’ve tried is: If (innerText.IndexOf(quote) != -1) { Label1.Text = “found”; } Else { Label1.Text = “not found”; }

But this isn’t accurate, it can’t find innerText that spans nodes (e.g. on more than one <p>). An example quote and URL that returns not found:

“The agile cover point of his youth had been reduced to standing in position and stopping only those balls that came near as dammit straight at him,” is how Charlie Connolly put it in Gilbert, his fine novel about Grace’s life. “In the Australians’ first innings he’d been only too aware of the catcalls of the crowd whenever the ball had sped past him.” At the end of the match, which England drew because of Ranjitsinhji’s 93, Grace told Jackson: “It’s all over, Jacker, I shan’t play again.”
Then there was Don Bradman. The story so famous it hardly needs retelling. “I dearly wanted to do well,” Bradman admitted. He was bowled second ball by Eric Hollies, “a perfect length googly” which just touched the inside edge of his bat and then knocked the off bail. If he had scored only four his average would have been an even hundred.

URL: http://www.theguardian.com/sport/2016/feb/23/test-cricket-farewells-brendon-mccullum

However, if I searched only the first paragraph:

“The agile cover point of his youth had been reduced to standing in position and stopping only those balls that came near as dammit straight at him,” is how Charlie Connolly put it in Gilbert, his fine novel about Grace’s life. “In the Australians’ first innings he’d been only too aware of the catcalls of the crowd whenever the ball had sped past him.” At the end of the match, which England drew because of Ranjitsinhji’s 93, Grace told Jackson: “It’s all over, Jacker, I shan’t play again.”

It would return found. Is there a way to achieve checking the text even when it spans nodes?

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Dave
  • 167
  • 17

1 Answers1

1

So, if you're only planning to scrape http://www.theguardian.com
This is a simple solution, since The Guardian's html code is quite neat.

var hdoc = new HtmlDocument();
hdoc.LoadHtml(data); // or hdoc.Load(data) - depending on what you get from your request
var articleNodes = hdoc.DocumentNode.SelectNodes(@"//p"); // the 'p' nodes contains the article text
var quote = "my quote";
var article = string.Empty;
foreach (HtmlNode node in articleNodes)
{
   article += node.InnerText + " "; // added a whitespace so we dont mess up the text.
}

if (article.Contains(quote))
{
   return true;
}
else
{
   return false;
}

Now if you're planning to make this for ANY given URL, there's trouble ahead.
Since you don't know the html format of those URL's the "best" - and by best i mean the most simple and cringe worthy solution is the following:

var hdoc = new HtmlDocument();
hdoc.LoadHtml(data); // or hdoc.Load(data) - depending on what you get from your request
var articleNodes = hdoc.DocumentNode;
var quote = "my quote";
var text = string.Empty;
foreach (var node in articleNodes.InnerText)
{
    text += node + " "; // added a whitespace so we dont mess up the text.

    foreach (var htmlNode in articleNodes.ChildNodes)
    {
        text += htmlNode.InnerText + " ";

        foreach (var childNode in htmlNode.ChildNodes)
        {
            text += childNode.InnerText + " ";

            foreach (var childrensChildren in childNode.ChildNodes)
            {
                text += childrensChildren.InnerText + " ";
            }
        }
    }
}

if (text.Contains(quote))
{
    return true;
}
else
{
    return false;
}

Ultimately, by not knowing the html code of the URL you're given, nesting foreach statements could increase or decrease. And there has to be some null checks on the nodes before running any of the foreach statements of course.
There might be a better solution out there, this is my 2 cents.

Working Example: This returns true, i copy+pasted a portion of the article into the quote variable and checked if our article string contained it.

string urlAddress = "http://www.theguardian.com/sport/2016/feb/23/test-cricket-farewells-brendon-mccullum";

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        string data = string.Empty;
        if (response.StatusCode == HttpStatusCode.OK)
        {
            Stream receiveStream = response.GetResponseStream();
            StreamReader readStream = null;

            if (response.CharacterSet == null)
            {
                readStream = new StreamReader(receiveStream);
            }
            else
            {
                readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
            }

            data = readStream.ReadToEnd();

            response.Close();
            readStream.Close();
        }

        var hdoc = new HtmlDocument();
        hdoc.LoadHtml(data); 
        var articleNodes = hdoc.DocumentNode.SelectNodes(@"//p"); // the 'p' nodes contains the article text
        var quote ="Sinatra couldn’t stand the song. His daughter Tina once said that her father thought it was “self-serving and self-indulgent”. By the end of the ’70s he was in the habit of introducing it by explaining how little he liked it. “I hate this song. I hate this song!” he said before performing it at Atlantic City in 1979. “I got it up to here, this goddamn song.” Of course when Sinatra died, pretty much every single TV and radio news show played him out with My Way, “the most obvious, ";
        var article = string.Empty;
        foreach (HtmlNode node in articleNodes)
        {
            article += node.InnerText + " "; // added a whitespace so we dont mess up the text.
        }

        bool containsQuote = false || article.Contains(quote); // bool is true if the quote is in the article.
Cicero
  • 2,872
  • 3
  • 21
  • 31
  • @Dave all right, your best bet would be the second example. To make sure that you get all the necessary text, make checks on the HtmlDocuments nodes, to check if they are null or just contains an empty string, before using a whole loop to iterate through. So try to have some code that analyzes the Html content. – Cicero Feb 24 '16 at 10:02
  • thanks. I tried initially the first one, because although my URLS may vary I think it's a relatively safe assumption that the html is ordered well and text inside

    tags. However, that returned a null exception, on the foreach (HtmlNode node in articleNodes) which is odd since my understanding of the code is saying find each p node from all the p nodes... and there are definitely p nodes. The second example returned an OutOfMemory Exception - I guess the string is too long?

    – Dave Feb 24 '16 at 12:26
  • apologies, I was decoding the html data earlier. Removed that and the Null Exception goes away. However, it still doesn't find the quote. – Dave Feb 24 '16 at 12:36
  • so my hunch is the "quote" and "article" are encoded differently, and the solution is to String.Replace all letters with the same encoding (i.e. replace all a with =) so that they are encoded identically? – Dave Feb 24 '16 at 12:49
  • @Dave I'm updating the answer, hold on a sec :) The answer should contain a working example, fetching the data and parsing it. – Cicero Feb 24 '16 at 13:01
  • @Dave yes there would be some differences, the article contains a special character like such: “ So the best thing would be to replace all special characters. Take a look at: http://stackoverflow.com/questions/1120198/most-efficient-way-to-remove-special-characters-from-string – Cicero Feb 24 '16 at 13:12
  • thanks. Just to clarify your code as above works when (as in your example) the quote is within a single node, but if it spans nodes (i.e. contains text from 2 paragraphs) then it returns false. using the regex @"\s+" and to replace multiple spaces with a single space returns the correct result. – Dave Feb 25 '16 at 09:22
  • @Dave No, the string Article contains ALL text from ALL the nodes. the += appends the string with each note. To remove extra spaces, special chars and so on - use Regex. – Cicero Feb 25 '16 at 09:36