I'm having trouble extracting a specific value from a large string return from a httpwebresponse. The response is unique each time as the site changes but I need to extract a single number from the source. Here is a snippet of the response and I need to extract the "9", although this could be a different number each time.
These are simply snippets, the source is 1,300 lines long this time. It might be triple that next time and the number in a different place. The only constant is that it appears outside all HTML tags.
</div>
<div id="inhoud_content_rechts">
<div id="taalkeuze"><a href="index.php" class="taalkeuze_link_actief">EN</a> | <a href="nl/index.php" class="taalkeuze_link">NL</a> | <a href="fr/index.php" class="taalkeuze_link">FR</a> | <a href="es/index.php" class="taalkeuze_link">ES</a></div>
<div id="print_page"><a href="javascript:window.print();" class="taalkeuze_link">â┼' print this page</a></div> <h1 class="titel">NEWS</h1>
<br />
<h1 class="nieuws_titel">12 | 4</h1>
9
<br /><br />
<a href="news.php" class="content_link">Back to overview â┼'</a>
<br /><br />
</div>
</div>
</div>
I cannot use regex match as the source changes each time, the only unique identifier I can think of is that the line is outside of the HTML, although so are a few things. I have tried to delete all the HTML tags with
System.Text.RegularExpressions.Regex regexHTML = new System.Text.RegularExpressions.Regex("<[^>]*>");
text = regexHTML.Replace(text, "");
although this does cut down the text considerably text is still left for example
EN | NL | FR | ES
â┼' print this page NEWS
12 | 4
9
Back to overview â┼'
I also tried a couple of others things:
- Converting all HTML to "@" the adding each line to a list & then skipping lines which don't contain "@" - probably the most successful attempt but the line containing just the number wasn't grab-able, I tried to remove all the spaces and using isDigit / isNumber but it returns false.
- Converting the entire string to char and cycling through each line to find isDigit - same problem as above
Does anybody have any ideas how I could write something which will extract the number I need? I thought maybe after deleting all the HTML I could check if the line ONLY contains a single int but had no success with isDigit, isNumber & int.parse. Here are the edited strings previous attempts if they're helpful. Converting HTML to "@" & removing all html
"@" Edit:
@@@@@@@@@@@@â┼' print this page@@@@@@@@@@@@ @@@@@@NEWS@@@@@@
@@@@@@
@@@@@@12 | 4@@@@@@
9
@@@@@@@@@@@@
@@@@@@Back to overview â┼'@@@@@@
@@@@@@@@@@@@
@@@@@@
@@@@@@
Removing all HTML:
EN | NL | FR | ES
â┼' print this page NEWS
12 | 4
9
Back to overview â┼'
TL:DR: Extract a number which always appeared outside the HTML with no other identifiers, it's on it's own line.