C# – Select distinct objects based on a property with Linq

There are three ways to select distinct objects based on a property using Linq methods:

//Simple
movies.GroupBy(t => t.YearOfRelease).Select(grp => grp.First());

//More complicated
movies.Distinct(new MovieYearComparer())

//Fast and simple, available in .NET 6 (or from open source before that)
movies.DistinctBy(t => t.YearOfRelease);
Code language: C# (cs)

These select one movie per year:

Godzilla is one movie from 2014
Jurassic Park is one movie from 1993
Deadpool is one movie from 2016
Inception is one movie from 2010
Joker is one movie from 2019
Black Panther is one movie from 2018Code language: plaintext (plaintext)

The simplest option is using GroupBy() because it doesn’t require any additional code. Distinct() is faster but it’s more complicated. DistinctBy() is the fastest and simplest, but requires the most code (it requires .NET 6 or getting the DistinctBy() source code).

In this article, I’ll go into more details about these different approaches and then compare their performance.

These approaches don’t work in EF Core though. I’ll show how to do this in EF Core in the section below called EF Core – Selecting rows based on a distinct column.

Select by multiple properties

To select distinct objects based on multiple properties, pass in an anonymous type with the properties you want.

For example, this is selecting distinct movies per director per year:

var moviesPerYear = movies.GroupBy(t => new { t.YearOfRelease, t.Director }).Select(grp => grp.First());

foreach(var movie in moviesPerYear)
{
	Console.WriteLine($"{movie.Name} is one movie from {movie.Director} in {movie.YearOfRelease}");
}

Code language: C# (cs)

This outputs the following:

Godzilla is one movie from Gareth Edwards in 2014
The Imitation Game is one movie from Morten Tyldum in 2014
Jurassic Park is one movie from Steven Spielberg in 1993
Deadpool is one movie from Tim Miller in 2016
Arrival is one movie from Denis Villenueve in 2016
Inception is one movie from Christopher Nolan in 2010
Joker is one movie from Todd Phillips in 2019
Black Panther is one movie from Ryan Coogler in 2018Code language: plaintext (plaintext)

Note: If you’re using DistinctBy() you also pass in an anonymous type for selecting by multiple properties, like this DistinctBy(t => new { t.YearOfRelease, t.Director }).

Using Distinct()

Distinct() is good if you want to just select distinct properties / primitives, instead of selecting objects based on distinct properties. For example:

var distinctYears = movies.Select(t => t.YearOfRelease).Distinct();
Code language: C# (cs)

But in the scenario where you want to select objects based on a distinct property, it’s not so simple. This is because by default, Distinct() checks for distinctness by comparing object references, which is not the correct behavior in this scenario. Instead, you have to implement and pass in an IEqualityComparer.

First, implement an IEqualityComparer that compares the specific properties:

public class MovieYearComparer : IEqualityComparer<Movie>
{
	public bool Equals([AllowNull] Movie x, [AllowNull] Movie y)
	{
		return x?.YearOfRelease == y?.YearOfRelease;
	}

	public int GetHashCode([DisallowNull] Movie obj)
	{
		return obj.YearOfRelease.GetHashCode();
	}
}
Code language: C# (cs)

Then pass an instance of this comparer into Distinct():

var moviesPerYear = movies.Distinct(new MovieYearComparer());

foreach (var movie in moviesPerYear)
{
	Console.WriteLine($"{movie.Name} is one movie from {movie.YearOfRelease}");
}
Code language: C# (cs)

This outputs the following:

Godzilla is one movie from 2014
Jurassic Park is one movie from 1993
Deadpool is one movie from 2016
Inception is one movie from 2010
Joker is one movie from 2019
Black Panther is one movie from 2018Code language: plaintext (plaintext)

For every combination of properties you want to use, you have to add a new IEqualityComparer class (or perhaps add a setting that controls which properties to look at). This is one of the main reasons why the GroupBy(property) approach is preferred over the Distinct(comparer) approach. The GroupBy(property) approach is more flexible and direct.

DistinctBy() source code before .NET 6

If you can’t use .NET 6 yet, you can use the DistinctBy() extension method source code:

using System;
using System.Collections.Generic;

public static class LinqExtensions
{
	public static IEnumerable<TSource> DistinctBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
	{
		HashSet<TKey> seenKeys = new HashSet<TKey>();
		foreach (TSource element in source)
		{
			if (seenKeys.Add(keySelector(element)))
			{
				yield return element;
			}
		}
	}
}
Code language: C# (cs)

Note: This is a generator method that yields one item at at item, which is good for memory consumption.

This was originally from an answer on StackOverflow (which was then put into the MoreLinq open source project), which then almost certainly inspired the DistinctBy() implementation in .NET 6. Note: The performance is practically the same as the .NET 6 version too.

Here’s an example of using this:

var moviesPerYear = movies.DistinctBy(t => t.YearOfRelease);

foreach (var movie in moviesPerYear)
{
	Console.WriteLine($"{movie.Name} is one movie from {movie.YearOfRelease}");
}
Code language: C# (cs)

This outputs one movie per year:

Godzilla is one movie from 2014
Jurassic Park is one movie from 1993
Deadpool is one movie from 2016
Inception is one movie from 2010
Joker is one movie from 2019
Black Panther is one movie from 2018Code language: plaintext (plaintext)

Performance comparison: Distinct() vs GroupBy() vs DistinctBy()

I used Benchmark.NET to compare the performance of the three approaches. I tested with input of 10k, 100k, and 1 million movie objects.

Compare the speed results in the following bar graph:

Benchmark bar graph comparing the performance of three methods: Distinct, DistinctBy, and GroupBy for input sizes of 10k, 100k, and 1 million.

Here are the full benchmark results (including the memory allocation stats):

|     Method |       N |        Mean |       Error |      StdDev |     Gen 0 |     Gen 1 |    Gen 2 | Allocated |
|----------- |-------- |------------:|------------:|------------:|----------:|----------:|---------:|----------:|
|   Distinct |   10000 |    363.8 us |     7.21 us |    16.57 us |   49.8047 |   49.8047 |  49.8047 |    227 KB |
|    GroupBy |   10000 |    795.1 us |    13.00 us |    13.91 us |  116.2109 |   38.0859 |        - |    402 KB |
| DistinctBy |   10000 |    234.6 us |     2.08 us |     1.95 us |   23.9258 |         - |        - |     74 KB |

|   Distinct |  100000 |  3,045.9 us |    64.98 us |   190.57 us |  222.6563 |  214.8438 | 214.8438 |  2,153 KB |
|    GroupBy |  100000 |  5,570.4 us |    94.12 us |   172.10 us |  453.1250 |  218.7500 |        - |  2,269 KB |
| DistinctBy |  100000 |  1,952.4 us |    38.02 us |    55.74 us |   23.4375 |         - |        - |     74 KB |

|   Distinct | 1000000 | 31,371.2 us |   616.46 us |   822.96 us |  312.5000 |  312.5000 | 312.5000 | 22,739 KB |
|    GroupBy | 1000000 | 60,840.5 us | 1,214.97 us | 2,252.02 us | 3222.2222 | 1222.2222 |        - | 19,766 KB |
| DistinctBy | 1000000 | 18,880.4 us |   372.73 us |   348.66 us |         - |         - |        - |     74 KB |Code language: plaintext (plaintext)

In conclusion, DistinctBy() is the fastest and most memory-efficient approach.

EF Core – Selecting rows based on a distinct column

These approaches don’t work with EF Core. For example, if you try to use the GroupBy(property) approach, you get an error like this:

System.InvalidOperationException: The LINQ expression ‘GroupByShaperExpression: … could not be translated. Either rewrite the query in a form that can be translated

Note: Distinct() gets all rows and checks for distinctness on the client-side, which is undesirable, so I wouldn’t suggest that either.

This makes sense if you think about how GROUP BY works in SQL. It’s used to aggregate rows. You group by a column, and select aggregate values about the group (usually using the other columns).

Instead of using a GROUP BY, you actually want to use a PARTITION BY in this scenario. You partition rows by one or more columns, and then select the first row in each partition.

Here’s the SQL query that uses PARTITION BY to select one movie row per distinct year:

WITH movieGroups AS
    (
    SELECT  *, ROW_NUMBER() OVER (PARTITION BY YearOfRelease ORDER BY id) rowNum
    FROM Movies
    )
SELECT *
FROM movieGroups
WHERE rowNum = 1
Code language: SQL (Structured Query Language) (sql)

You can execute this as raw SQL, or use the following code that generates an equivalent query:

var moviesByYear =
from yearOfRelease in context.Movies.Select(x => x.YearOfRelease).Distinct()
from movie in context.Movies
	.Where(x => x.YearOfRelease == yearOfRelease)
	.Take(1)
select movie;

foreach (var movie in moviesByYear)
{
	Console.WriteLine($"{movie.Name} is one movie by {movie.Director} from {movie.YearOfRelease}");
}
Code language: C# (cs)

Executing this outputs the following:

Jurassic Park is one movie made in 1993
Inception is one movie made in 2010
Godzilla is one movie made in 2014
Deadpool is one movie made in 2016
Black Panther is one movie made in 2018
Joker is one movie made in 2019Code language: plaintext (plaintext)

The generated query for this isn’t exactly the same as the optimal query, but it’s close enough.

Executing the optimal query directly

It’s a good idea to always check the Linq-generated SQL query. If it’s not what you expected, or not optimized enough, you can always write your own query and execute it directly (as a raw query or in a stored proc).

Here’s how you’d execute the optimal query as a raw SQL query:

using (var context = new StreamingServiceContext(connectionString))
{
	var moviesByYear = context.Movies.FromSqlRaw(
		@"WITH movieGroups AS
			(
			SELECT *, ROW_NUMBER() OVER (PARTITION BY YearOfRelease ORDER BY id) rowNum
			FROM Movies
			)
		SELECT *
		FROM movieGroups
		WHERE rowNum = 1");

	foreach (var movie in moviesByYear)
	{
		Console.WriteLine($"{movie.Name} is one movie made in {movie.YearOfRelease}");
	}
}
Code language: C# (cs)

This produces the same results as the Linq-generated query, but it’s faster. In my opinion, it’s also simpler, which means it’s better for maintainability.

Comments are closed.