Distinct by part of the string in linq

Question

Given this collection:

var list = new [] {
    "1.one",
    "2. two",
    "no number",
    "2.duplicate",
    "300. three hundred",
    "4-ignore this"};

How can I get subset of items that start with a number followed by a dot (regex @"^\d+(?=\.)") with distinct numbers? That is:

{"1.one", "2. two", "300. three hundred"}

UPDATE:

My attempt on this was to use an IEqualityComparer to pass to the Distinct method. I borrowed this GenericCompare class and tried the following code to no avail:

var pattern = @"^\d+(?=\.)";
var comparer = new GenericCompare<string>(s => Regex.Match(s, pattern).Value);
list.Where(f => Regex.IsMatch(f, pattern)).Distinct(comparer);

Daniel J.G. · Accepted Answer · 2014-08-27T08:05:22.200

If you fancy an approach with Linq, you can try adding a named capture group to the regex, then filter the items that match the regex, group by the captured number and finally get only the first string for each number. I like the readability of the solution but I wouldn´t be surprised if there is a more efficient way of eliminating the duplicates, let´s see if somebody else comes with a different approach.

Something like this:

list.Where(s => regex.IsMatch(s))
    .GroupBy(s => regex.Match(s).Groups["num"].Value)
    .Select(g => g.First())

You can give it a try with this sample:

public class Program
{
    private static readonly Regex regex = new Regex(@"^(?<num>\d+)\.", RegexOptions.Compiled);

    public static void Main()
    {
        var list = new [] {
            "1.one",
            "2. two",
            "no number",
            "2.duplicate",
            "300. three hundred",
            "4-ignore this"
        };

        var distinctWithNumbers = list.Where(s => regex.IsMatch(s))
                                      .GroupBy(s => regex.Match(s).Groups["num"].Value)
                                      .Select(g => g.First());

        distinctWithNumbers.ToList().ForEach(Console.WriteLine);
        Console.ReadKey();
    }       
}

You can try the approach it in this fiddle

As pointed by @orad in the comments, there is a Linq extension DistinctBy() in MoreLinq that could be used instead of grouping and then getting the first item in the group to eliminate the duplicates:

var distinctWithNumbers = list.Where(s => regex.IsMatch(s))
                              .DistinctBy(s => regex.Match(s).Groups["num"].Value);

Try it in this fiddle

EDIT

If you want to use your comparer, you need to implement the GetHashCode so it uses the expression as well:

public int GetHashCode(T obj)
{
    return _expr.Invoke(obj).GetHashCode();
}

Then you can use the comparer with a lambda function that takes a string and gets the number using the regex:

var comparer = new GenericCompare<string>(s => regex.Match(s).Groups["num"].Value);
var distinctWithNumbers = list.Where(s => regex.IsMatch(s)).Distinct(comparer);

I have created another fiddle with this approach.

Using lookahead regex

You can use any of these 2 approaches with the regex @"^\d+(?=\.)".

Just change the lambda expressions getting the "num" group s => regex.Match(s).Groups["num"].Value with a expression that gets the regex match s => regex.Match(s).Value

Updated fiddle here.

Great, this seems to work. I will mark it as answer if nothing better comes. Also please see my update on using a comparer. Thanks. — orad, Aug 26 '14 at 20:12
Minor note, instead of regex groups you could use regex lookahead like `@"^\d+(?=\.)"` so that only number part is matched. — orad, Aug 26 '14 at 21:08
Simpler solution: use `DistinctBy` from MoreLINQ. See Jon Skeet's answer [here](http://stackoverflow.com/a/1300116/450913). — orad, Aug 26 '14 at 22:33
I came up with a solution that does not run the regex twice on the collection by using a Dictionary. See my [answer](http://stackoverflow.com/a/25516465/450913). — orad, Aug 26 '14 at 23:05
Nice, using the `DistinctBy` extension the code is really nice and simple. I guess you may want to measure the different approaches and then decide whether to go for performance or readability. (Also, you could avoid the parsing in your solution, using the match as the key, an empty string when no match) — Daniel J.G., Aug 27 '14 at 08:11

orad · Answer 2 · 2014-09-11T18:05:07.953

1

(I could mark this as answer too)

This solution works without duplicate regex runs:

var regex = new Regex(@"^\d+(?=\.)", RegexOptions.Compiled);
list.Select(i => {
    var m = regex.Match(i);
    return new KeyValuePair<int, string>( m.Success ? Int32.Parse(m.Value) : -1, i );
})
.Where(i => i.Key > -1)
.GroupBy(i => i.Key)
.Select(g => g.First().Value);

Run it in this fiddle.

edited Sep 11 '14 at 18:05

answered Aug 26 '14 at 22:59

orad

15,272
23
77
113

Like @daniel-j-g said, parsing can be avoided here, using the match as the key, an empty string when no match. – orad Aug 27 '14 at 20:50

score 1 · Answer 3 · answered Jan 30 '19 at 15:07

Your solution is good enough.

You can also use LINQ query syntax to avoid regex re-runs with the help of let keyword as follows:

var result =
        from kvp in
        (
            from s in source
            let m = regex.Match(s)
            where m.Success
            select new KeyValuePair<int, string>(int.Parse(m.Value), s)
        )
        group kvp by kvp.Key into gr
        select new string(gr.First().Value);

score -1 · Answer 4 · answered Aug 26 '14 at 19:51

-1

Something like this should work:

List<string> c = new List<string>()
{
    "1.one",
    "2. two",
    "no number",
    "2.duplicate",
    "300. three hundred",
    "4-ignore this"
};

c.Where(i =>
{
    var match = Regex.Match(i, @"^\d+(?=\.)");
    return match.Success;
});

answered Aug 26 '14 at 19:51

Ian P

12,840
6
48
70

This will also include `"2.duplicate"`. The key in my question is how to get **Distinct** numbers. – orad Aug 26 '14 at 19:57

Distinct by part of the string in linq

4 Answers4

Linked