From a purely sorting perspective, the problem with this request is that you are asking to sort only a sub-set of the items in the list by multiple dimensions but wish to maintain the general position of that sub-set within the larger set.
We can't achieve this in a deterministic way with Array.Sort()
or with a custom IComparable
implementation because the external set needs to maintain it's original arbitrary sequence and that sequence information is not contained within the string value itself.
But there are ways we can do this with LINQ.
I would like to sort efficiently
Do you mean you would like the least amount of code, or the least amount of iterations or the least time to execute or the least amount of memory utilisation?
- Processing large datasets in this way can easily consume a large amount of resources.
Recognising that this list of strings together forms a number of XML elements means that we could potentially deserialize into objects, re-process the data and then re-serialize back into a list of strings. This is most likely the least efficient in terms of code and time, but will produce the most reliable results and the logic will be a lot simpler to understand. The memory consumption should be manageable... as long as you can guarantee the input was valid XML fragments, this one (at the time of writing) IS NOT
Your initial attempt is not a bad one, but it will result there is however a technique that doesn't involve maintaining separate isolated lists of records. We can classify (categorize) the data into grouping sets so we can sort on the indexes of the groups or values within those sets without actually storing them or modifying the original string.
- The techniques here are commonly used in SQL Gap and Island analysis queries, that is essentially what this type of problem is.
First, lets deal with the numeric sorting. I find the easiest solution is to left pad all numeric values with zeros to a fixed length, the following method accepts a string and the number of digits that any number found in the string will be padded to.
This logic uses the classification technique to identify the numbers from the text, we determine the groups to be characters that are either all digits or or all letters (well not digits) and assign each group an index. Then we can use a LINQ group expression to group on these indexes and left pad the numeric groups.
Use this solution instead of the PInvoke in this answer: https://stackoverflow.com/a/75986862/1690217
/// <summary> Pad numeric portions of a string to allow natural lexicographic sorting to behave like numerical sorting for the numeric components</summary>
/// <remarks>
/// using the same classification technique because we can.
/// <para>- It has not been tested for performance.</para>
/// Sorry about the name, couldn't help myself ;)
/// </remarks>
/// <param name="value">The string value to pad the numbers in</param>
/// <param name="size">The number of digits to pad the numeric values out to, by default 3 digits</param>
/// <returns> alphanumeric string that so that "Bird 2" becomes "Bird 02", to allow "Bird 2" to be sorted before "Bird 10"<returns>
public static string LexicographicToAlphaNumeric(string value, int size = 3)
{
// exit if there is nothing to process.
if (String.IsNullOrWhiteSpace(value)) return value;
int groupId = 0;
bool isNumberGroup = false;
return String.Join("", value.Select(c => {
var isNumber = Char.IsDigit(c); // NOTE: will not support decimals
if( isNumber != isNumberGroup)
{
groupId ++;
isNumberGroup = isNumber;
}
return new { c, isNumber, groupId };
})
.GroupBy(c => c.groupId)
.Select(g => g.First().isNumber
? new string(g.Select(x => x.c).ToArray()).PadLeft(size, '0')
: new string(g.Select(x => x.c).ToArray())
)
);
}
You can test this here:
var tests = new string [] {
"Bird 10",
"Bird 2",
"Bird 1",
"20Lions",
"5Lions",
"red10blue",
"Red9blue",
"red100Blue"
};
foreach(var x in tests.Select(x => LexicographicToAlphaNumeric(x)).OrderBy(x => x))
Console.WriteLine(x);
Produces this output:
005Lions
020Lions
Bird 001
Bird 002
Bird 010
Red009blue
red010blue
red100Blue
Back to the original problem...
So the sort order needs to be this:
- Follow the original overall sequence for all items that do NOT start with
<Animal Name=
- Sort by
Group=
attribute value within the Animal items
- Sort by the
Name=
attribute within the same Group values
- after adjusting for numeric suffixes
We approach this in the same way, use a LINQ projection (Select) to classify our grouping sets, then use LINQ sort (OrderBy) over the groups to achieve the desired sort hierarchy: https://dotnetfiddle.net/707YIW
// Helpers for the top level grouping set
int setId = 0;
string sortSetPrefix = "<Animal Name=";
bool isInSortSet = false;
// Regex Helpers for the attribute grouping sets
var groupRegex = new System.Text.RegularExpressions.Regex(@"Group=\""([^\""]*)\""");
var nameRegex = new System.Text.RegularExpressions.Regex(@"Name=\""([^\""]*)\""");
var sortedData = somedata.Select((line, index) =>
{
var lineInSortSet = line.Trim().StartsWith(sortSetPrefix);
if (lineInSortSet != isInSortSet)
{
setId ++;
isInSortSet = lineInSortSet;
}
// NOTE: Match.Value is NOT the captured value, but the whole value: Group="Equidae"
// This will produce the correct results for less code/effort
var groupName = LexicographicToAlphaNumeric(groupRegex.Match(line).Value);
var animalName = LexicographicToAlphaNumeric(nameRegex.Match(line).Value);
return new { line, index, setId, groupName, animalName };
})
.ToList() // Optional, cache the results to reduce re-entry of the above logic during sorting.
.OrderBy(x => x.setId).ThenBy(x => x.groupName).ThenBy(x => x.animalName).ThenBy(x => x.index)
.Select(x => x.line) // discard the classifications
.ToList(); // Deliberate evaluation of the sort expression here prevent re-evaluation later
For those who look closely, or are new to LINQ, you can see that I have used an overload of Select to expose the index from the underlying enumerator. This is how we can access the overall sequence without deliberately iterating though the whole set first.
This solves the problem that Array.Sort
and IComparable
can not, if we didn't use IEnumerable.Select
in this way then we would have to use a for
loop to iterate through the items...
The following output includes the classification data so you can see how it works:
Test Categorized Sorting output:
0,,,000: <some data>
0,,,001: <some data else>
0,,,002: </some data>
0,,,003: <Animal>
1,Group="Bird 001",Name="Eagle",015: <Animal Name="Eagle" Group="Bird 1" Colour="Brown" />
1,Group="Bird 001",Name="Falcon",013: <Animal Name="Falcon" Group="Bird 1" Colour="Brown" />
1,Group="Bird 002",Name="Duck",014: <Animal Name="Duck" Group="Bird 2" Colour="White" />
1,Group="Canidae",Name="Dog002",011: <Animal Name="Dog2" Group="Canidae" Colour="Black" />
1,Group="Canidae",Name="Dog020",010: <Animal Name="Dog20" Group="Canidae" Colour="Black" />
1,Group="Equidae",Name="Horse001",007: <Animal Name="Horse1" Group="Equidae" Colour="Black" />
1,Group="Equidae",Name="Horse002",009: <Animal Name="Horse2" Group="Equidae" Colour="White" />
1,Group="Equidae",Name="Horse010",008: <Animal Name="Horse10" Group="Equidae" Colour="White" />
1,Group="Equidae",Name="Zebra",005: <Animal Name="Zebra" Group="Equidae" Colour="Black and White" />
1,Group="Feline",Name="Cat",012: <Animal Name="Cat" Group="Feline" Colour="Various" />
1,Group="Feline",Name="Lion",004: <Animal Name="Lion" Group="Feline" Colour="Brown" />
1,Group="Fish",Name="Shark",016: <Animal Name="Shark" Group="Fish" Colour="Gray" />
1,Group="Fish",Name="Tuna",006: <Animal Name="Tuna" Group="Fish" Colour="Gray" />
1,Group="Rodent",Name="Mouse",017: <Animal Name="Mouse" Group="Rodent" Colour="Brown" />
2,,,018: </Animal>
2,,,019: <some other data>
2,,,020: <bla bla bla>
2,,,021: </some other data>
This sorting logic is so unique to this particular input that there is little value in trying to package the sort logic into an IComparable
implementation, even if you could. The IEnumerable
logic can however be abstracted into a custom Enumeration which will look pretty, but it doesn't make the code any easier to maintain. Instead it hides the logic from the user and this can have a negative impact on your team's ability to understand and maintain this logic in the long term: https://dotnetfiddle.net/lxI16F
public static IEnumerable<string> SortAnimalsWithoutDisturbingTheOtherLines(IEnumerable<string> lines)
{
// Helpers for the top level grouping set
int setId = 0;
string sortSetPrefix = "<Animal Name=";
bool isInSortSet = false;
// Regex Helpers for the attribute grouping sets
var groupRegex = new System.Text.RegularExpressions.Regex(@"Group=\""([^\""]*)\""");
var nameRegex = new System.Text.RegularExpressions.Regex(@"Name=\""([^\""]*)\""");
foreach(var line in lines.Select((line, index) =>
{
var lineInSortSet = line.Trim().StartsWith(sortSetPrefix);
if (lineInSortSet != isInSortSet)
{
setId ++;
isInSortSet = lineInSortSet;
}
// NOTE: Match.Value is NOT the captured value, but the whole value: Group="Equidae"
// This will produce the correct results for less code/effort
var groupName = LexicographicToAlphaNumeric(groupRegex.Match(line).Value);
var animalName = LexicographicToAlphaNumeric(nameRegex.Match(line).Value);
return new { line, index, setId, groupName, animalName };
})
.ToList() // Cache the results to reduce re-entry of the above logic during sorting.
.OrderBy(x => x.setId).ThenBy(x => x.groupName).ThenBy(x => x.animalName).ThenBy(x => x.index)
.Select(x => x.line)
)
yield return line;
}
Usage, close to .Sort()
:
var sortedData = SortAnimalsWithoutDisturbingTheOtherLines(somedata).ToList(); // ToList() => Deliberate evaluation of the sort expression here prevent re-evaluation
Notes:
It doesn't make sense to expose this logic as an extension of a List<string>
because the implementation is too specific. You might do this for some sort of business domain type, but not a list of strings, that would be an anti-pattern.
If you need to support numbers larger than 3 digits, then either change the default size
implementation of the abhorrently named LexicographicToAlphaNumeric
or provide a value to the size
parameter
var animalName = LexicographicToAlphaNumeric(nameRegex.Match(line).Value, 7);
We can't know inside the function what maximum length is appropriate without first parsing ALL of the possible values, which would not be as efficient.