C# – Select distinct objects based on a property

There are three ways to select distinct objects based on a property using Linq:

//Simple movies.GroupBy(t => t.YearOfRelease).Select(grp => grp.First()); //More complicated movies.Distinct(new MovieYearComparer()) //Fast and simple, available in .NET 6 (or from open source before that) movies.DistinctBy(t => t.YearOfRelease);
Code language: C# (cs)

These select one movie per year:

Godzilla is one movie from 2014 Jurassic Park is one movie from 1993 Deadpool is one movie from 2016 Inception is one movie from 2010 Joker is one movie from 2019 Black Panther is one movie from 2018
Code language: plaintext (plaintext)

GroupBy() is the simplest because it doesn’t require any additional code. Distinct() is faster but it’s more complicated. DistinctBy() is the fastest and simplest, but requires the most code (it requires .NET 6 or getting the DistinctBy() source code).

In this article, I’ll go into more details about these different approaches and then compare their performance.

These approaches don’t work in EF Core though. I’ll show how to do this in EF Core in the section below called EF Core – Selecting rows based on a distinct column.

Select by multiple properties

To select distinct objects based on multiple properties, pass in an anonymous type with the properties you want.

For example, this is selecting distinct movies per director per year:

var moviesPerYear = movies.GroupBy(t => new { t.YearOfRelease, t.Director }).Select(grp => grp.First()); foreach(var movie in moviesPerYear) { Console.WriteLine($"{movie.Name} is one movie from {movie.Director} in {movie.YearOfRelease}"); }
Code language: C# (cs)

This outputs the following:

Godzilla is one movie from Gareth Edwards in 2014 The Imitation Game is one movie from Morten Tyldum in 2014 Jurassic Park is one movie from Steven Spielberg in 1993 Deadpool is one movie from Tim Miller in 2016 Arrival is one movie from Denis Villenueve in 2016 Inception is one movie from Christopher Nolan in 2010 Joker is one movie from Todd Phillips in 2019 Black Panther is one movie from Ryan Coogler in 2018
Code language: plaintext (plaintext)

Note: If you’re using DistinctBy() you also pass in an anonymous type for selecting by multiple properties, like this DistinctBy(t => new { t.YearOfRelease, t.Director }).

Using Distinct()

Distinct() is good if you want to just select distinct properties / primitives, instead of selecting objects based on distinct properties. For example:

var distinctYears = movies.Select(t => t.YearOfRelease).Distinct();
Code language: C# (cs)

But in the scenario where you want to select objects based on a distinct property, it’s not so simple. This is because by default, Distinct() checks for distinctness by comparing object references, which is not the correct behavior in this scenario. Instead, you have to implement and pass in an IEqualityComparer.

First, implement an IEqualityComparer that compares the specific properties:

public class MovieYearComparer : IEqualityComparer<Movie> { public bool Equals([AllowNull] Movie x, [AllowNull] Movie y) { return x?.YearOfRelease == y?.YearOfRelease; } public int GetHashCode([DisallowNull] Movie obj) { return obj.YearOfRelease.GetHashCode(); } }
Code language: C# (cs)

Then pass an instance of this comparer into Distinct():

var moviesPerYear = movies.Distinct(new MovieYearComparer()); foreach (var movie in moviesPerYear) { Console.WriteLine($"{movie.Name} is one movie from {movie.YearOfRelease}"); }
Code language: C# (cs)

This outputs the following:

Godzilla is one movie from 2014 Jurassic Park is one movie from 1993 Deadpool is one movie from 2016 Inception is one movie from 2010 Joker is one movie from 2019 Black Panther is one movie from 2018
Code language: plaintext (plaintext)

For every combination of properties you want to use, you have to add a new IEqualityComparer class (or perhaps add a setting that controls which properties to look at). This is one of the main reasons why the GroupBy(property) approach is preferred over the Distinct(comparer) approach. The GroupBy(property) approach is more flexible and direct.

DistinctBy() source code before .NET 6

If you can’t use .NET 6 yet, you can use the DistinctBy() extension method source code:

using System; using System.Collections.Generic; public static class LinqExtensions { public static IEnumerable<TSource> DistinctBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) { HashSet<TKey> seenKeys = new HashSet<TKey>(); foreach (TSource element in source) { if (seenKeys.Add(keySelector(element))) { yield return element; } } } }
Code language: C# (cs)

This was originally from an answer on StackOverflow (which was then put into the MoreLinq open source project), which then almost certainly inspired the DistinctBy() implementation in .NET 6. Note: The performance is practically the same as the .NET 6 version too.

Here’s an example of using this:

var moviesPerYear = movies.DistinctBy(t => t.YearOfRelease); foreach (var movie in moviesPerYear) { Console.WriteLine($"{movie.Name} is one movie from {movie.YearOfRelease}"); }
Code language: C# (cs)

This outputs one movie per year:

Godzilla is one movie from 2014 Jurassic Park is one movie from 1993 Deadpool is one movie from 2016 Inception is one movie from 2010 Joker is one movie from 2019 Black Panther is one movie from 2018
Code language: plaintext (plaintext)

Performance comparison: Distinct() vs GroupBy() vs DistinctBy()

I used Benchmark.NET to compare the performance of the three approaches. I tested with input of 10k, 100k, and 1 million movie objects.

Compare the speed results in the following bar graph:

Benchmark bar graph comparing the performance of three methods: Distinct, DistinctBy, and GroupBy for input sizes of 10k, 100k, and 1 million.

Here are the full benchmark results (including the memory allocation stats):

| Method | N | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |----------- |-------- |------------:|------------:|------------:|----------:|----------:|---------:|----------:| | Distinct | 10000 | 363.8 us | 7.21 us | 16.57 us | 49.8047 | 49.8047 | 49.8047 | 227 KB | | GroupBy | 10000 | 795.1 us | 13.00 us | 13.91 us | 116.2109 | 38.0859 | - | 402 KB | | DistinctBy | 10000 | 234.6 us | 2.08 us | 1.95 us | 23.9258 | - | - | 74 KB | | Distinct | 100000 | 3,045.9 us | 64.98 us | 190.57 us | 222.6563 | 214.8438 | 214.8438 | 2,153 KB | | GroupBy | 100000 | 5,570.4 us | 94.12 us | 172.10 us | 453.1250 | 218.7500 | - | 2,269 KB | | DistinctBy | 100000 | 1,952.4 us | 38.02 us | 55.74 us | 23.4375 | - | - | 74 KB | | Distinct | 1000000 | 31,371.2 us | 616.46 us | 822.96 us | 312.5000 | 312.5000 | 312.5000 | 22,739 KB | | GroupBy | 1000000 | 60,840.5 us | 1,214.97 us | 2,252.02 us | 3222.2222 | 1222.2222 | - | 19,766 KB | | DistinctBy | 1000000 | 18,880.4 us | 372.73 us | 348.66 us | - | - | - | 74 KB |
Code language: plaintext (plaintext)

In conclusion, DistinctBy() is the fastest and most memory-efficient approach.

EF Core – Selecting rows based on a distinct column

These approaches don’t work with EF Core. For example, if you try to use the GroupBy(property) approach, you get an error like this:

System.InvalidOperationException: The LINQ expression ‘GroupByShaperExpression: … could not be translated. Either rewrite the query in a form that can be translated

Note: Distinct() gets all rows and checks for distinctness on the client-side, which is undesirable, so I wouldn’t suggest that either.

This makes sense if you think about how GROUP BY works in SQL. It’s used to aggregate rows. You group by a column, and select aggregate values about the group (usually using the other columns).

Instead of using a GROUP BY, you actually want to use a PARTITION BY in this scenario. You partition rows by one or more columns, and then select the first row in each partition.

Here’s the SQL query that uses PARTITION BY to select one movie row per distinct year:

WITH movieGroups AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY YearOfRelease ORDER BY id) rowNum FROM Movies ) SELECT * FROM movieGroups WHERE rowNum = 1
Code language: SQL (Structured Query Language) (sql)

You can execute this as raw SQL, or use the following code that generates an equivalent query:

var moviesByYear = from yearOfRelease in context.Movies.Select(x => x.YearOfRelease).Distinct() from movie in context.Movies .Where(x => x.YearOfRelease == yearOfRelease) .Take(1) select movie; foreach (var movie in moviesByYear) { Console.WriteLine($"{movie.Name} is one movie by {movie.Director} from {movie.YearOfRelease}"); }
Code language: C# (cs)

Executing this outputs the following:

Jurassic Park is one movie made in 1993 Inception is one movie made in 2010 Godzilla is one movie made in 2014 Deadpool is one movie made in 2016 Black Panther is one movie made in 2018 Joker is one movie made in 2019
Code language: plaintext (plaintext)

The generated query for this isn’t exactly the same as the optimal query, but it’s close enough.

Executing the optimal query directly

It’s a good idea to always check the Linq-generated SQL query. If it’s not what you expected, or not optimized enough, you can always write your own query and execute it directly (as a raw query or in a stored proc).

Here’s how you’d execute the optimal query as a raw SQL query:

using (var context = new StreamingServiceContext(connectionString)) { var moviesByYear = context.Movies.FromSqlRaw( @"WITH movieGroups AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY YearOfRelease ORDER BY id) rowNum FROM Movies ) SELECT * FROM movieGroups WHERE rowNum = 1"); foreach (var movie in moviesByYear) { Console.WriteLine($"{movie.Name} is one movie made in {movie.YearOfRelease}"); } }
Code language: C# (cs)

This produces the same results as the Linq-generated query, but it’s faster. In my opinion, it’s also simpler, which means it’s better for maintainability.

Leave a Comment