Here I'm trying to extract one word from an HTML page.
For example, there are two textboxes (1 and 2). now I'm trying to give stackoverflow question ID on textbox1 and get "asked" value on textbox2.
For example, if I give 36 on textbox1 this should give me "9 years, 4 months ago" on textbox2.
WebClient webpage = new WebClient();
String html = webpage.DownloadString("https://stackoverflow.com/questions/" + textBox1.Text);
MatchCollection match = Regex.Matches(html, FILTERHERE, RegexOptions.Singleline);
The problem is I don't know how to filter my output (FILTERHERE)?
Also how can I send my output into textbox2?
Asked
Active
Viewed 336 times
1

derloopkat
- 6,232
- 16
- 38
- 45

Leviathan
- 31
- 1
- 9
-
1Using Regex on html is a [bad idea](https://stackoverflow.com/a/1732454/) – Aleks Andreev Dec 16 '17 at 08:45
-
Can you do this with your own method? – Leviathan Dec 16 '17 at 08:48
-
Consider using XPath or css selectors – Aleks Andreev Dec 16 '17 at 08:52
-
Can you do this one with Xpatch for me, please? – Leviathan Dec 16 '17 at 09:16
-
1@Leviathan, the xpath is `//*[@id='qinfo']//td[./p[@class='label-key' and text()='asked']]/following-sibling::td//b/text()`. In the future consider using HtmlAgilityPack instead of RegEx. – derloopkat Dec 16 '17 at 21:30
2 Answers
3
With HtmlAgilityPack.
string url = "https://stackoverflow.com/questions/";
var web = new HtmlWeb();
var doc = web.Load(url + textBox1.Text); //the text is "36"
var tag = doc.DocumentNode.SelectSingleNode("//*[@id='qinfo']//td[./p[@class='label-key' and text()='asked']]/following-sibling::td//b");
textBox2.Text = tag.InnerText;
If you don't know XPath, there are browser extensions for Chrome and Firefox that gets the XPath of any Html tag for you (I personally write them manually to make them less sensitive to changes on page structure).

derloopkat
- 6,232
- 16
- 38
- 45
2
With Windows Forms applicationWebBrowser
control canbe used wthich wpapps the mshtml library and exposes managed HTML DOM
. Example of function which retrieves the asked
text:
private static string GetAskedText(HtmlDocument doc)
{
if (doc == null)
return "document-null";
IEnumerable<mshtml.HTMLDivElement> divs = doc.GetElementsByTagName("div")
.OfType<HtmlElement>()
.Select(e => e.DomElement as mshtml.HTMLDivElement);
foreach (var div in divs)
{
if (string.IsNullOrWhiteSpace(div?.className))
continue;
if (div.className.Trim().ToLower() != "user-info")
continue;
var spans = div.getElementsByTagName("span").OfType<mshtml.HTMLSpanElement>();
foreach (var span in spans)
{
if (string.IsNullOrWhiteSpace(span?.className))
continue;
if (span.className == "relativetime")
{
return span.innerText;
}
}
}
return "not-found";
}
Complete example with Windows Forms application can be downloaded from my dropbox.

Daniel Dušek
- 13,683
- 5
- 36
- 51
-
Thanks but it's a console application. I asked about windows form application. – Leviathan Dec 16 '17 at 20:35
-
Windows Forms? That is just a very small difference. The function `GetAsked` remains the same. – Daniel Dušek Dec 16 '17 at 20:36
-
-
The Windows Forms version can be downloaded from my dropbox, see link in edited answer. HTH – Daniel Dušek Dec 16 '17 at 21:07