0

I have this web server data collected in a string array. which i am aligning properly using Regex for better readable format.

string[] liness = Regex.Split(html, "\r\n");

data inside liness now looks like this.

    <html><head><title>137.55.124.65 - /</title></head><body><H1>137.55.124.65 - /</H1><hr>

   Thursday, June 7, 2018  6:27 PM        &lt;dir&gt; <A HREF="/2.5.25557/">2.5.25557</A>
      Thursday, June 14, 2018  5:25 PM        &lt;dir&gt; <A HREF="/2.5.25569/">2.5.25569</A>
     Wednesday, June 20, 2018  8:34 AM        &lt;dir&gt; <A HREF="/2.5.25578/">2.5.25578</A>
     Wednesday, June 20, 2018  5:33 PM        &lt;dir&gt; <A HREF="/2.5.25580/">2.5.25580</A>
       Tuesday, June 26, 2018  8:36 AM        &lt;dir&gt; <A HREF="/2.5.25581/">2.5.25581</A>
        Friday, June 29, 2018  8:36 AM        &lt;dir&gt; <A HREF="/2.5.25582/">2.5.25582</A>
        Tuesday, July 3, 2018  8:35 AM        &lt;dir&gt; <A HREF="/2.5.25584/">2.5.25584</A>
       Thursday, July 5, 2018  8:35 AM        &lt;dir&gt; <A HREF="/2.5.25586/">2.5.25586</A>
        Monday, July 16, 2018  8:33 AM        &lt;dir&gt; <A HREF="/2.5.25587/">2.5.25587</A>
        Tuesday, May 29, 2018  8:30 PM          696 <A HREF="/iisstart.htm">iisstart.htm</A>
        Tuesday, May 29, 2018  8:30 PM        98757 <A HREF="/iisstart.png">iisstart.png</A>
 Wednesday, November 19, 2014  3:41 PM          214 <A HREF="/index.html">index.html</A>

How better ways can i extract only the values which starts with 2.*.**** (ex: 2.5.8827)and if you notice each line has HREF="/2.5.25425/">also which is a duplicate value. parse and put all of those values into a list and then this is the tricky part get the highest version number( a single value )

ex: 2.5.1000 , 2.5 1001. 2.5.1002. 2.5.1003.

my highest version from the above example list is 2.5.1003

i have tried the above using regex.

List<string> versionvalue = new List<string>();
            string pattern = "2."; 
            foreach (String l_html in liness)
            {
                string[] substrings = Regex.m(l_html, pattern);
                //versionvalue.Add(substrings[]);
             if ((l_html.Contains("2.")) && (l_html.Contains(currentYear.ToString()) ))
               {

               }
            }

but looks very messed up and did not find any values i was looking for. will regex.matches work ? all help appreciated!

Anil Gadiyar
  • 399
  • 1
  • 2
  • 16
  • Imo , parsing Html is a [dupe](https://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c). And finding the max version from `new List {"2.5.1000", "2.5 1001", "2.5.1002", "2.5.1003" }` is its own question that solve pretty fast if version have the same number of part. Put Version in an object, Overlaod comparer. – Drag and Drop Jul 17 '18 at 06:43
  • We need to know what you have tried, and a specific error to work with. You will get hit with the "Stack overflow is not a code writing service comment" your input is clear, expected output somewhat clear. is 3.5.1001 higher than 2.5.1004? Load the content in a XML document, then select node "HR" and run through the child nodes and get the inner xml. Depending on your answer to previous, make a sorter, pick top sorted item. Should fix it no? – Morten Bork Jul 17 '18 at 06:46
  • As Rafalon pointed out getting your value could be trick but look for all A link then filter them based on Href= inner Html and you should be good. Add the `/`if needed.* – Drag and Drop Jul 17 '18 at 06:46
  • @MortenBork i have tried lync query, but sort of messed up & i did not post that ,cause i wanted to keep the question simple and understandable. yes, version 3.5.1001 is higher than 2.5.1004 but that is a question for some other day. if you could provide me few visible ideas is all i am asking – Anil Gadiyar Jul 17 '18 at 06:56
  • @DragandDrop good suggestion. – Anil Gadiyar Jul 17 '18 at 06:59
  • Care regex based on "number and dot" combinaison can capture the ip or anything else in the html. It really depends of the variance of your html. a simple `(\d\.){2}\d{4,}` could do the trick . Escape the Dot or it mean anychar in a regex. – Drag and Drop Jul 17 '18 at 07:19
  • 1
    [You shouldn't really use regex to parse html](https://stackoverflow.com/a/1732454/106159) – Matthew Watson Jul 17 '18 at 07:42
  • @MatthewWatson any other suggestion then ? – Anil Gadiyar Jul 17 '18 at 08:17
  • An XML parser could work. – Matthew Watson Jul 17 '18 at 08:20
  • @MatthewWatson any examples i can refer – Anil Gadiyar Jul 17 '18 at 08:37
  • Actually, XML parsing will only work on a subset of HTML. You might have better luck using [the Html Agility Pack](http://html-agility-pack.net/). Also see [the answers to this question](https://stackoverflow.com/questions/846994/how-to-use-html-agility-pack) for more information. – Matthew Watson Jul 17 '18 at 08:43
  • @MatthewWatson 'HtmlAgilityPack' already has a dependency defined for 'System.Net.Http' above error pops up when nuget ing. – Anil Gadiyar Jul 17 '18 at 09:07
  • Did you install it via Visual Studio? `Project | Manage NuGet Packages` then click *Browse* and search for *Html Agility Pack*, select it and click *Install*. – Matthew Watson Jul 17 '18 at 09:46

1 Answers1

2

The regex pattern you are looking for is <A HREF="\/(\d\.\d\.\d{5})\/">, i.e. capture a single digit, dot, single digit, dot, five digits that are inside an <A HREF="">. Regex 101 for this pattern.

After you extracted these strings, parse them into a VersionNumber class. This class implements the comparison through the IComparable interface. This makes sure the VersionNumbers can be sorted correctly with OrderBy.

public class VersionNumber : IComparable {

    public int Major    { get; set; }
    public int Minor    { get; set; }
    public int Revision { get; set; }

    // Converts string to VersionNumber object
    public static VersionNumber Parse(string s) {
        if (string.IsNullOrWhiteSpace(s)) {
            throw new ArgumentNullException(nameof(s));
        }

        var parts = s.Split(new [] {'.'});
        if (parts.Count() != 3) {
            throw new ArgumentException("Input string must be in format 'X.Y.ZZZZZ'.");
        }

        var result = new VersionNumber();
        try {
            result.Major    = int.Parse(parts[0]);
            result.Minor    = int.Parse(parts[1]);
            result.Revision = int.Parse(parts[2]);
        }
        catch (FormatException) {
            throw new ArgumentException("Input string must be in format 'X.Y.ZZZZZ', with X, Y, Z integers.");
        }

        return result;
    }

    // Compares two VersionNumbers
    public int CompareTo(object obj) {
        if (obj == null) return 1;

        VersionNumber otherVersion = obj as VersionNumber;
        if (otherVersion == null) {
            throw new ArgumentException($"Object is not a {nameof(VersionNumber)}.");
        }

        // start comparison with Major Version, then Minor, then Revision
        var result = Major.CompareTo(otherVersion.Major);
        if (result == 0) {
            result = Minor.CompareTo(otherVersion.Minor);
        }
        if (result == 0) {
            result = Revision.CompareTo(otherVersion.Revision);
        }
        return result;
    }

    public override string ToString() {
         return Major + "." + Minor + "." + Revision;
    }
}

See also this .Net Fiddle with example usage:

string[] versionStrings = new [] {"3.5.25569", "2.5.25557", "2.5.25580", "2.5.25569", "2.4.25569"};
// parsing
IEnumerable<VersionNumber> versions = versionStrings.Select(s => VersionNumber.Parse(s));
// sorting
IOrderedEnumerable<VersionNumber> sorted = versions.OrderBy(v => v);
// sorted: 2.4.25569, 2.5.25557, 2.5.25569, 2.5.25580, 3.5.25569
Georg Patscheider
  • 9,357
  • 1
  • 26
  • 36
  • a well written code. but i may need some clarification. 1. do I use regex to get all the version numbers and put them into a list or string array ? 2. you are expected to pass a string ? as i have a string array and i need to compare every array elements with each other to get the higher version. – Anil Gadiyar Jul 17 '18 at 08:32
  • 1. Yes, you use the regex on every line, access the capture group and put these captures into a `string[] extractedStrings` 2. You can convert the captured strings to VersionNumbers using LINQ: `VersionNumber[] versions = extractedStrings.Select(s => VersionNumber.Parse(s)).ToArray();` Then sort them: `var sortedVersions = versions.OrderBy(v => v);` – Georg Patscheider Jul 17 '18 at 09:05
  • what about the compare method, what obj are you passing there ? – Anil Gadiyar Jul 17 '18 at 14:35
  • The compare method is used by the [`OrderBy`](https://learn.microsoft.com/en-us/dotnet/api/system.linq.queryable.orderby?view=netframework-4.7.1#System_Linq_Queryable_OrderBy__2_System_Linq_IQueryable___0__System_Linq_Expressions_Expression_System_Func___0___1___). Here is a [C# fiddle](https://dotnetfiddle.net/j6SAGs) of the usage. I also fixed the `Parse` method. – Georg Patscheider Jul 18 '18 at 07:10
  • Always glad to help :) – Georg Patscheider Jul 18 '18 at 08:24