1

Is there any way to simplify the following code so it looks clearer and more elegant?

The following code returns a collection of values found in a collection of texts, using Linq and regex:

IEnumerable<double> _results = pages.Select(result => {
    Regex _regex = new Regex("<my regex here>", RegexOptions.None);
    MatchCollection _matches = _regex.Matches(result);
    double _number = 0.0;

    foreach (Match _match in _matches) {
        if (_match.Groups["value"].Value.Contains("("))
            break;
        else
           double.TryParse(_match.Groups["value"].Value, out _number);
    }

    return _number;
});

As you can see, the regex is tricky, it is basically returning the last value found in each text before a condition is met, and that is the desired outcome.

How could you simplify the previous code looking for elegance? Memory and CPU utilization is not a problem.

Miguel Mateo
  • 189
  • 1
  • 15
  • Why not code your regex so it doesn't pick up matches with a bracket in, or uses the presence of the bracket to determine which first prior value to return as a match collection value? This question would benefit from a sample of your raw data with a highlight of what data you want to pluck out of it – Caius Jard Jun 30 '19 at 04:28
  • @CaiusJard believe me, you do not want to see the raw data, it is worse than html, it is basically very cryptic logs generated by servers. But the algorithm is still very valid: the last number found by the regex before a condition is met. – Miguel Mateo Jun 30 '19 at 04:47
  • If i didn't want to see it, i wouldn't have asked ;) – Caius Jard Jun 30 '19 at 05:03
  • If you insist :) ... this is a chunk: reactid="17"/>2,941.7616.84(12.7) – Miguel Mateo Jun 30 '19 at 05:12

3 Answers3

1

I would do it this way, if I'm understanding your code properly, this syntax is valid in C#7.0 with inline out variable declaration:

Regex _regex = new Regex("<my regex here>", RegexOptions.None);

IEnumerable<double> _results = pages.Select(_regex.Matches)
 .Where(match => !match.Groups["Value"].Value.Contains("("))
 .Select(match => double.TryParse(match.Groups["Value"].Value, out double number) ? number : number);
Dan D
  • 2,493
  • 15
  • 23
  • Please note that the match could be more than one before the parenthesis is found, in that case, I need the last one. Your code seems to return all items found before the condition, while I need just the last one. – Miguel Mateo Jun 30 '19 at 04:44
  • I like this idea, I am currently refining it since it does not compile either ... but I see where you are coming from. – Miguel Mateo Jun 30 '19 at 05:14
  • thanks for showing me a path to follow, I have added an answer to this question with the final code based on your suggestion. – Miguel Mateo Jun 30 '19 at 07:48
1

Notwithstanding bobince's advice about regex and HTML :) here's a regex based solution:

.NET's regex engine can go backwards, so we can leverage this and have our rex look for the number in > < that is nearest (use pessimistic matcher .*?) the last bracketed value:

>(?<v>[,.0-9]+)<.*?\([.0-9]+\)

This is "match and name the number between > < then the shortest amount of any characters, then a number between ( )" - tweak as you need

Regex r = new Regex(">(?<v>[,.0-9]+)<.*?\([.0-9]+\)", RegexOptions.RightToLeft /*other options here*/);
foreach(var p in pages){

  Match m = r.Match(p, p.Length - 1);

  MessageBox.Show(m.Groups["v"].Value); //finds 16.84
}

For example:

enter image description here

See it here

Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • Why would somebody say that his answer is not useful? I will try @Caius and will let you know, using my real regexp. Thanks! – Miguel Mateo Jun 30 '19 at 06:17
  • I followed your suggestion, changed the regexp, and now it returns one value, which further simp[lies the Linq statement. Thanks! – Miguel Mateo Jun 30 '19 at 09:15
1

Adding on top of @dan-d's answer, this is perhaps the simplest to read and more elegant code:

double[] _results = _pages
    .Select(page => _regex.Matches(page).Cast<Match>().Select(value => value.Groups["value"].Value))
    .Select(value => value.TakeWhile(condition => !condition.Contains("(")).Last())
    .Select(number => double.TryParse(number, out double _result) ? _result : _result)
    .ToArray();

The first select iterates through all the data pages and returns arrays with all found values using the regular expression. The second select finds the last value right before the condition (does the value has a parenthesis), for each page; while the final select evaluates the results, returning an array of doubles.

Finally, after following the suggestion of @caius-jard, improving the regexp now returns one value, so further simplifies the linq statement to the following:

double[] _results = _pages
    .Select(page => _regex.Matches(page).Cast<Match>().Select(value => value.Groups["value"].Value).First())
    .Select(number => double.TryParse(number, out double _result) ? _result : _result)
    .ToArray();
Miguel Mateo
  • 189
  • 1
  • 15