I want to be able to search a html document, scraped from a URL, and verify that the URL contains specific text. Both the text and URL are supplied by the user, and can vary. I scrape the URL with a httpWeb Request
string quote = txtQuote.Text;
string sourceURL = txtURL.Text;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(sourceURL);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
{
readStream = new StreamReader(receiveStream);
}
else
{
readStream = new StreamReader(receiveStream,
Encoding.GetEncoding(response.CharacterSet));
}
string data = readStream.ReadToEnd();
response.Close();
readStream.Close();
I also have a list of html entities and the various possible encodings in my database, which I retrieve and pass to a DataTable so I can change any encodings to the standard html entity and replace non breaking spaces with a standard space
DataTable encodings = new DataTable();
string getEncodings = "select * from htmlentities";
SqlCommand cmdGetEncodings = new SqlCommand(getEncodings, dbcon);
encodings.Load(cmdGetEncodings.ExecuteReader());
dbcon.Close();
foreach (DataRow row in encodings.Rows)
{
string htmlentity = row[1].ToString();
string deccode = row[2].ToString();
string hexcode = row[3].ToString();
data = data.Replace(deccode, htmlentity);
data = data.Replace(hexcode, htmlentity);
data = data.Replace(“ ”, “ “);
}
And I then use htmlAgilityPack to pass the scraped and amended html to a new doc, and retrieve the inner text HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(data);
HtmlNode root = doc.DocumentNode;
string innerText = root.InnerText;
Now I’m wondering, what is the best way to accurately verify if quote is contained within innerText? One way I’ve tried is: If (innerText.IndexOf(quote) != -1) { Label1.Text = “found”; } Else { Label1.Text = “not found”; }
But this isn’t accurate, it can’t find innerText that spans nodes (e.g. on more than one <p>
). An example quote and URL that returns not found:
“The agile cover point of his youth had been reduced to standing in position and stopping only those balls that came near as dammit straight at him,” is how Charlie Connolly put it in Gilbert, his fine novel about Grace’s life. “In the Australians’ first innings he’d been only too aware of the catcalls of the crowd whenever the ball had sped past him.” At the end of the match, which England drew because of Ranjitsinhji’s 93, Grace told Jackson: “It’s all over, Jacker, I shan’t play again.”
Then there was Don Bradman. The story so famous it hardly needs retelling. “I dearly wanted to do well,” Bradman admitted. He was bowled second ball by Eric Hollies, “a perfect length googly” which just touched the inside edge of his bat and then knocked the off bail. If he had scored only four his average would have been an even hundred.
URL: http://www.theguardian.com/sport/2016/feb/23/test-cricket-farewells-brendon-mccullum
However, if I searched only the first paragraph:
“The agile cover point of his youth had been reduced to standing in position and stopping only those balls that came near as dammit straight at him,” is how Charlie Connolly put it in Gilbert, his fine novel about Grace’s life. “In the Australians’ first innings he’d been only too aware of the catcalls of the crowd whenever the ball had sped past him.” At the end of the match, which England drew because of Ranjitsinhji’s 93, Grace told Jackson: “It’s all over, Jacker, I shan’t play again.”
It would return found. Is there a way to achieve checking the text even when it spans nodes?