2

I would like to sort efficiently some data. Suppose that I have this string list:

var somedata = new List<string> {"<some data>",
    "  <some data else>",
    "</some data>",
    "<Animal>",
    "  <Animal Name=\"Lion\" Group=\"Feline\" Colour=\"Brown\" />",
    "  <Animal Name=\"Zebra\" Group=\"Equidae\" Colour=\"Black and White\" />",
    "  <Animal Name=\"Tuna\" Group=\"Fish\" Colour=\"Gray\" />",
    "  <Animal Name=\"Horse1\" Group=\"Equidae\" Colour=\"Black\" />",
    "  <Animal Name=\"Horse10\" Group=\"Equidae\" Colour=\"White\" />",
    "  <Animal Name=\"Horse2\" Group=\"Equidae\" Colour=\"White\" />",
    "  <Animal Name=\"Dog20\" Group=\"Canidae\" Colour=\"Black\" />",
    "  <Animal Name=\"Dog2\" Group=\"Canidae\" Colour=\"Black\" />",
    "  <Animal Name=\"Cat\" Group=\"Feline\" Colour=\"Various\" />",
    "  <Animal Name=\"Falcon\" Group=\"Bird 1\" Colour=\"Brown\" />",
    "  <Animal Name=\"Duck\" Group=\"Bird 2\" Colour=\"White\" />",
    "  <Animal Name=\"Eagle\" Group=\"Bird 1\" Colour=\"Brown\" />",
    "  <Animal Name=\"Shark\" Group=\"Fish\" Colour=\"Gray\" />",
    "  <Animal Name=\"Mouse\" Group=\"Rodent\" Colour=\"Brown\" />",
    "</Animal>",
    "<some other data>"
    "  <bla bla bla>"
    "</some other data>"};

Actually, I sort them in this simple way

somedata.Sort();

and then I copied them in a new list iterating the list of the Group= (Feline, Equidae, Fish, Feline, Bird 1, Bird 2....) parameter because I would divide the list per group type. This iteration copy also the "other data" without AnimalName in a second new list that then I merge between using the command templist.Add(line) between the markers "<Animal>" and "</Animal>". (No double strings will be there because I copy all without the data to sort that are copied later).

Ok, we can say the way that I followed works at 99% (I have number problem) but I would like to use directly Sort(). How can I create a custom sorting filter for the parameter <Animal Name= and Group= leaving there any strings without <Animal Name=?

By the way, when I sort in my current way, I see a problem. The data with numbers don't follow a grammar order rule, so I get Horse1, Horse10 and Horse2 instead as I would like to get: Horse1, Horse2 and Horse10.

How can I fix this last problem?

Thanks in advance

The output is something like this:

<some data>
  <some data else>
</some data>
<Animal>,
  <Animal Name="Eagle" Group="Bird 1" Colour="Brown" />
  <Animal Name="Falcon" Group="Bird 1" Colour="Brown" />
  <Animal Name="Duck" Group="Bird 2" Colour="White" />
  <Animal Name="Dog2" Group="Canidae" Colour="Black" />
  <Animal Name="Dog20" Group="Canidae" Colour="Black" />
  <Animal Name="Horse1" Group="Equidae" Colour="Black" />
  <Animal Name="Horse2" Group="Equidae" Colour="White" />
  <Animal Name="Horse10" Group="Equidae" Colour="White" />
  <Animal Name="Zebra" Group="Equidae" Colour="Black and White" />
  <Animal Name="Cat" Group="Feline" Colour="Various" />
  <Animal Name="Lion" Group="Feline" Colour="Brown" />
  <Animal Name="Shark" Group="Fish" Colour="Gray" /> 
  <Animal Name="Tuna" Group="Fish" Colour="Gray"  />
  <Animal Name="Mouse" Group="Rodent" Colour="Brown" />
</Animal> />
<some other data>
  <bla bla bla>
</some other data>
  • 3
    It seems your input data is _nearly_ valid XML. Is it possible to work with the XML directly and use an XML parser rather than individual string lines? – gunr2171 Apr 11 '23 at 14:02
  • 1
    If you cast your strings to valid xml, you can use XElement to convert them to an object and extract attribute values for sorting manipulations. Another way is to use regex. – Maxim Apr 11 '23 at 14:04
  • 2
    @Maxim I'd _never_ recommend using regex to parse XML/HTML, as it has been [famously documented](https://stackoverflow.com/a/1732454/1043380) – gunr2171 Apr 11 '23 at 14:05
  • 1
    Whilst this is a trivial example, your post would be improved by providing the specific output you are requesting – Chris Schaller Apr 11 '23 at 14:06
  • 1
    @gunr2171 Yes, in effect it's an XML file. At the beginning, I should add some data only, and I follow this way that at this time it's tricky for sorting, I will follow this way – orecchione bruno Apr 11 '23 at 14:47
  • @Chris Schaller I added the output that I would like to get – orecchione bruno Apr 11 '23 at 15:06

2 Answers2

3

To sort horse10 after horse2, see natural sort order in c#. You can use the same sorting the windows uses by calling a native function, and wrapping this in a IComparer:

[DllImport("shlwapi.dll", CharSet = CharSet.Unicode)]
private static extern int StrCmpLogicalW(string psz1, string psz2);

However, trying to sort strings like this without actually parsing them seem very fragile. Since the data seem to be kind of XML I would suggest using some type of xml parsing to convert it into a list of objects. You can then use LINQ .OrderBy(...) and .ThenBy(...) if you want to sort the objects according to multiple properties.

JonasH
  • 28,608
  • 2
  • 10
  • 23
  • 1
    You are right, it's an XML file, I will follow your suggestion, to order the XML – orecchione bruno Apr 11 '23 at 14:47
  • 1
    Your answer on importing a built in sort order is from 2009....surely .net has implemented something comparable since then, right? It just sounds wild to need to import an *OS specific* algo in .NET which is now cross platform – Narish Apr 11 '23 at 15:28
  • Full agreement that parsing the XML is the way to go though – Narish Apr 11 '23 at 15:29
  • @Narish As far as I know there is no natural sorting built into any version of .net. But there are libraries available if you prefer that. – JonasH Apr 12 '23 at 06:30
2

From a purely sorting perspective, the problem with this request is that you are asking to sort only a sub-set of the items in the list by multiple dimensions but wish to maintain the general position of that sub-set within the larger set.

We can't achieve this in a deterministic way with Array.Sort() or with a custom IComparable implementation because the external set needs to maintain it's original arbitrary sequence and that sequence information is not contained within the string value itself.

But there are ways we can do this with LINQ.

I would like to sort efficiently

Do you mean you would like the least amount of code, or the least amount of iterations or the least time to execute or the least amount of memory utilisation?

  • Processing large datasets in this way can easily consume a large amount of resources.

Recognising that this list of strings together forms a number of XML elements means that we could potentially deserialize into objects, re-process the data and then re-serialize back into a list of strings. This is most likely the least efficient in terms of code and time, but will produce the most reliable results and the logic will be a lot simpler to understand. The memory consumption should be manageable... as long as you can guarantee the input was valid XML fragments, this one (at the time of writing) IS NOT

Your initial attempt is not a bad one, but it will result there is however a technique that doesn't involve maintaining separate isolated lists of records. We can classify (categorize) the data into grouping sets so we can sort on the indexes of the groups or values within those sets without actually storing them or modifying the original string.

  • The techniques here are commonly used in SQL Gap and Island analysis queries, that is essentially what this type of problem is.

First, lets deal with the numeric sorting. I find the easiest solution is to left pad all numeric values with zeros to a fixed length, the following method accepts a string and the number of digits that any number found in the string will be padded to.

This logic uses the classification technique to identify the numbers from the text, we determine the groups to be characters that are either all digits or or all letters (well not digits) and assign each group an index. Then we can use a LINQ group expression to group on these indexes and left pad the numeric groups.

Use this solution instead of the PInvoke in this answer: https://stackoverflow.com/a/75986862/1690217

/// <summary> Pad numeric portions of a string to allow natural lexicographic sorting to behave like numerical sorting for the numeric components</summary>
/// <remarks> 
/// using the same classification technique because we can. 
/// <para>- It has not been tested for performance.</para>
/// Sorry about the name, couldn't help myself ;)
/// </remarks>
/// <param name="value">The string value to pad the numbers in</param>
/// <param name="size">The number of digits to pad the numeric values out to, by default 3 digits</param>
/// <returns> alphanumeric string that so that "Bird 2" becomes "Bird 02", to allow "Bird 2" to be sorted before "Bird 10"<returns>
public static string LexicographicToAlphaNumeric(string value, int size = 3)
{
    // exit if there is nothing to process.
    if (String.IsNullOrWhiteSpace(value)) return value;
    
    int groupId = 0;
    bool isNumberGroup = false;
    return String.Join("", value.Select(c => {
            var isNumber = Char.IsDigit(c); // NOTE: will not support decimals
            if( isNumber != isNumberGroup)
            {
                groupId ++;
                isNumberGroup = isNumber;
            }
            return new { c, isNumber, groupId };
        })
        .GroupBy(c => c.groupId)
        .Select(g => g.First().isNumber 
            ? new string(g.Select(x => x.c).ToArray()).PadLeft(size, '0') 
            : new string(g.Select(x => x.c).ToArray())
        )
    );
}

You can test this here:

    var tests = new string [] {
        "Bird 10",
        "Bird 2",
        "Bird 1",
        "20Lions",
        "5Lions",
        "red10blue",
        "Red9blue",
        "red100Blue"
    };
    foreach(var x in tests.Select(x => LexicographicToAlphaNumeric(x)).OrderBy(x => x))
        Console.WriteLine(x);

Produces this output:

005Lions
020Lions
Bird 001
Bird 002
Bird 010
Red009blue
red010blue
red100Blue

Back to the original problem...

So the sort order needs to be this:

  1. Follow the original overall sequence for all items that do NOT start with <Animal Name=
  2. Sort by Group= attribute value within the Animal items
  3. Sort by the Name= attribute within the same Group values
    • after adjusting for numeric suffixes

We approach this in the same way, use a LINQ projection (Select) to classify our grouping sets, then use LINQ sort (OrderBy) over the groups to achieve the desired sort hierarchy: https://dotnetfiddle.net/707YIW

    // Helpers for the top level grouping set
    int setId = 0;
    string sortSetPrefix = "<Animal Name=";
    bool isInSortSet = false;
    
    // Regex Helpers for the attribute grouping sets
    var groupRegex = new System.Text.RegularExpressions.Regex(@"Group=\""([^\""]*)\""");
    var nameRegex = new System.Text.RegularExpressions.Regex(@"Name=\""([^\""]*)\""");
    
    var sortedData = somedata.Select((line, index) => 
                    {
                        var lineInSortSet = line.Trim().StartsWith(sortSetPrefix);
                        if (lineInSortSet != isInSortSet) 
                        {
                            setId ++;
                            isInSortSet = lineInSortSet;
                        }

                        // NOTE: Match.Value is NOT the captured value, but the whole value: Group="Equidae"
                        //       This will produce the correct results for less code/effort 
                        var groupName = LexicographicToAlphaNumeric(groupRegex.Match(line).Value);
                        var animalName = LexicographicToAlphaNumeric(nameRegex.Match(line).Value);
                                    
                        return new { line, index, setId, groupName, animalName };
                    })
                 .ToList() // Optional, cache the results to reduce re-entry of the above logic during sorting.
                 .OrderBy(x => x.setId).ThenBy(x => x.groupName).ThenBy(x => x.animalName).ThenBy(x => x.index)
                 .Select(x => x.line) // discard the classifications
                 .ToList(); // Deliberate evaluation of the sort expression here prevent re-evaluation later

For those who look closely, or are new to LINQ, you can see that I have used an overload of Select to expose the index from the underlying enumerator. This is how we can access the overall sequence without deliberately iterating though the whole set first.

This solves the problem that Array.Sort and IComparable can not, if we didn't use IEnumerable.Select in this way then we would have to use a for loop to iterate through the items...

The following output includes the classification data so you can see how it works:

Test Categorized Sorting output:
0,,,000:                                          <some data>
0,,,001:                                            <some data else>
0,,,002:                                          </some data>
0,,,003:                                          <Animal>
1,Group="Bird 001",Name="Eagle",015:                <Animal Name="Eagle" Group="Bird 1" Colour="Brown" />
1,Group="Bird 001",Name="Falcon",013:               <Animal Name="Falcon" Group="Bird 1" Colour="Brown" />
1,Group="Bird 002",Name="Duck",014:                 <Animal Name="Duck" Group="Bird 2" Colour="White" />
1,Group="Canidae",Name="Dog002",011:                <Animal Name="Dog2" Group="Canidae" Colour="Black" />
1,Group="Canidae",Name="Dog020",010:                <Animal Name="Dog20" Group="Canidae" Colour="Black" />
1,Group="Equidae",Name="Horse001",007:              <Animal Name="Horse1" Group="Equidae" Colour="Black" />
1,Group="Equidae",Name="Horse002",009:              <Animal Name="Horse2" Group="Equidae" Colour="White" />
1,Group="Equidae",Name="Horse010",008:              <Animal Name="Horse10" Group="Equidae" Colour="White" />
1,Group="Equidae",Name="Zebra",005:                 <Animal Name="Zebra" Group="Equidae" Colour="Black and White" />
1,Group="Feline",Name="Cat",012:                    <Animal Name="Cat" Group="Feline" Colour="Various" />
1,Group="Feline",Name="Lion",004:                   <Animal Name="Lion" Group="Feline" Colour="Brown" />
1,Group="Fish",Name="Shark",016:                    <Animal Name="Shark" Group="Fish" Colour="Gray" />
1,Group="Fish",Name="Tuna",006:                     <Animal Name="Tuna" Group="Fish" Colour="Gray" />
1,Group="Rodent",Name="Mouse",017:                  <Animal Name="Mouse" Group="Rodent" Colour="Brown" />
2,,,018:                                          </Animal>
2,,,019:                                          <some other data>
2,,,020:                                            <bla bla bla>
2,,,021:                                          </some other data>

This sorting logic is so unique to this particular input that there is little value in trying to package the sort logic into an IComparable implementation, even if you could. The IEnumerable logic can however be abstracted into a custom Enumeration which will look pretty, but it doesn't make the code any easier to maintain. Instead it hides the logic from the user and this can have a negative impact on your team's ability to understand and maintain this logic in the long term: https://dotnetfiddle.net/lxI16F

public static IEnumerable<string> SortAnimalsWithoutDisturbingTheOtherLines(IEnumerable<string> lines)
{
    // Helpers for the top level grouping set
    int setId = 0;
    string sortSetPrefix = "<Animal Name=";
    bool isInSortSet = false;
    
    // Regex Helpers for the attribute grouping sets
    var groupRegex = new System.Text.RegularExpressions.Regex(@"Group=\""([^\""]*)\""");
    var nameRegex = new System.Text.RegularExpressions.Regex(@"Name=\""([^\""]*)\""");
    
    foreach(var line in lines.Select((line, index) => 
                                {
                                    var lineInSortSet = line.Trim().StartsWith(sortSetPrefix);
                                    if (lineInSortSet != isInSortSet) 
                                    {
                                        setId ++;
                                        isInSortSet = lineInSortSet;
                                    }

                                    // NOTE: Match.Value is NOT the captured value, but the whole value: Group="Equidae"
                                    //       This will produce the correct results for less code/effort 
                                    var groupName = LexicographicToAlphaNumeric(groupRegex.Match(line).Value);
                                    var animalName = LexicographicToAlphaNumeric(nameRegex.Match(line).Value);
                                    
                                    return new { line, index, setId, groupName, animalName };
                                })
                             .ToList() // Cache the results to reduce re-entry of the above logic during sorting.
                             .OrderBy(x => x.setId).ThenBy(x => x.groupName).ThenBy(x => x.animalName).ThenBy(x => x.index)
                             .Select(x => x.line) 
                             )
            yield return line;
}

Usage, close to .Sort() :

var sortedData = SortAnimalsWithoutDisturbingTheOtherLines(somedata).ToList(); // ToList() => Deliberate evaluation of the sort expression here prevent re-evaluation 

Notes:

  • It doesn't make sense to expose this logic as an extension of a List<string> because the implementation is too specific. You might do this for some sort of business domain type, but not a list of strings, that would be an anti-pattern.

  • If you need to support numbers larger than 3 digits, then either change the default size implementation of the abhorrently named LexicographicToAlphaNumeric or provide a value to the size parameter

     var animalName = LexicographicToAlphaNumeric(nameRegex.Match(line).Value, 7);
    

    We can't know inside the function what maximum length is appropriate without first parsing ALL of the possible values, which would not be as efficient.

Chris Schaller
  • 13,704
  • 3
  • 43
  • 81
  • thank you for your example. it looks great but I see that if I replace this data, it doesn't work correctly with more digits: `"Bird 10000" "Bird 2000" "Bird 1000"` Anyway, it's clear the example, thank you! – orecchione bruno Apr 12 '23 at 12:23
  • 1
    I can only code to the example parameters you have posted, however I did leave a parameter in there for you to control the length of the number string. I have added some notes and tried to explain why `.Sort()` is not only unachievable, but undesirable... To achieve it would be incredibly inefficient, which conflicts with what was asked. I have also updated the fiddle to show usage of the `size` parameter https://dotnetfiddle.net/Wr7xeq – Chris Schaller Apr 12 '23 at 14:42
  • yes I saw it, I tried your solution and I like it, thank you – orecchione bruno Apr 12 '23 at 15:58