-1

I'm having trouble extracting a specific value from a large string return from a httpwebresponse. The response is unique each time as the site changes but I need to extract a single number from the source. Here is a snippet of the response and I need to extract the "9", although this could be a different number each time.

These are simply snippets, the source is 1,300 lines long this time. It might be triple that next time and the number in a different place. The only constant is that it appears outside all HTML tags.

                </div>
              <div id="inhoud_content_rechts">
                        <div id="taalkeuze"><a href="index.php" class="taalkeuze_link_actief">EN</a> | <a href="nl/index.php" class="taalkeuze_link">NL</a> | <a href="fr/index.php" class="taalkeuze_link">FR</a> | <a href="es/index.php" class="taalkeuze_link">ES</a></div>

<div id="print_page"><a href="javascript:window.print();" class="taalkeuze_link">â┼' print this page</a></div>                    <h1 class="titel">NEWS</h1>
                    <br />

                    <h1 class="nieuws_titel">12 | 4</h1>
                    9
                    <br /><br />
                    <a href="news.php" class="content_link">Back to overview â┼'</a>
                    <br /><br />
                </div>
            </div>
        </div>

I cannot use regex match as the source changes each time, the only unique identifier I can think of is that the line is outside of the HTML, although so are a few things. I have tried to delete all the HTML tags with

System.Text.RegularExpressions.Regex regexHTML = new System.Text.RegularExpressions.Regex("<[^>]*>");
text = regexHTML.Replace(text, "");

although this does cut down the text considerably text is still left for example

                        EN | NL | FR | ES

â┼' print this page                    NEWS


                    12 | 4
                    9

                    Back to overview â┼'

I also tried a couple of others things:

  • Converting all HTML to "@" the adding each line to a list & then skipping lines which don't contain "@" - probably the most successful attempt but the line containing just the number wasn't grab-able, I tried to remove all the spaces and using isDigit / isNumber but it returns false.
  • Converting the entire string to char and cycling through each line to find isDigit - same problem as above

Does anybody have any ideas how I could write something which will extract the number I need? I thought maybe after deleting all the HTML I could check if the line ONLY contains a single int but had no success with isDigit, isNumber & int.parse. Here are the edited strings previous attempts if they're helpful. Converting HTML to "@" & removing all html

"@" Edit:

@@@@@@@@@@@@â┼' print this page@@@@@@@@@@@@                    @@@@@@NEWS@@@@@@
                    @@@@@@

                    @@@@@@12 | 4@@@@@@
                    9
                    @@@@@@@@@@@@
                    @@@@@@Back to overview â┼'@@@@@@
                    @@@@@@@@@@@@
                @@@@@@
            @@@@@@

Removing all HTML:

                       EN  |  NL  |  FR  |  ES

  â┼' print this page                       NEWS


                     12 | 4
                    9

                     Back to overview â┼'

TL:DR: Extract a number which always appeared outside the HTML with no other identifiers, it's on it's own line.

  • 2
    There are specific libraries to deal with HTML text. You should search for "C# HTML Parser Libraries". One of the most famous is [Html Agility Pack](https://html-agility-pack.net/) also [Regular expressions to parse Html is a bad idea](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Steve Jul 10 '19 at 12:20
  • 1
    The key to this sort of problem is to work out how *you* identify the text to extract. If it's really any text on it's own line, you could use a regex to look for lines that don't include `<` or `>` - or you could use a parser and look for an `h1` with the right class, and return it's content. – Robin Bennett Jul 10 '19 at 12:20
  • I'm not dealing with HTML, I'm trying to extract one number which isn't even inside any HTML. I also explained that the regex isn't possible because the source changes frequently and yes it's really on it's own... did you both even read the post? Not to be rude but both of what you just said is wrong, irrelevant and already addresses in OP. I'm not trying to use regex to capture the number I only used it to remove all HTML tags. The source is over 1,300 lines long this time, it might be triple that next time. There are tons of lines without HTML on them from the top of the document. – Connor Raven Jul 10 '19 at 12:34
  • 1
    If my reply didn't help, it's because the problem isn't clear, not because I didn't read it. Being rude isn't going to help. If there are multiple pieces of text on their own line, how do you know which one you want? – Robin Bennett Jul 10 '19 at 12:46
  • You don't need to answer any more, you're really not helping. I appreciate your effort but I'm tired, I've been working on this project for 26 hours straight and a question like "how do you know which one you want" is just aggravating. I put that's the one I want in the thread, the reason I want it isn't relevant to the question one bit. I simply needed to extract the 9 and I put the constraints in my question. I even put a method I tried where I already removed all the HTML, you say you read the question yet your answer is something I already tried and posted in the original post. – Connor Raven Jul 10 '19 at 13:31

1 Answers1

0

What about something like this:

  int? number = html.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
    .Select(l =>
    {
      l = l.Trim();
      if (l.Length == 1 && int.TryParse(l, out int num))
        return (int?)num;
      return null;
    }).FirstOrDefault(n => n != null);

  Console.WriteLine(number);

As I read the question it's a single digit - hence the l.Length == 1 check. If it can be any number you can omit the length check.

This works if the number searched for is on a line of its own.


An alternative using Regex:

  Match match = Regex.Match(html, @"</.+>\s*\n*(?<num>\d+)\s*\n*<.+>");
  if (match.Success)
    Console.WriteLine(match.Groups["num"]);

The pattern finds a number between a closing tag: </xxxx> and an opening tag: <xxx> and any white spaces and/or new lines are allowed in between.

It works for numbers being on lines with or without html