The simplest (and most efficient) way to remove duplicates from a list is by iterating, keeping track of items you’ve seen with a HashSet, and discarding items you’ve already seen. I’ll show four ways to implement this O(n) algorithm. At the end, I’ll explain a few inefficient approaches to avoid.
Remove duplicates with ToHashSet() and ToList()
The most concise way to remove duplicates from a list is by using the Linq methods ToHashSet() and ToList(), like this:
using System.Linq;
var list = new List<int>
{
1,1,1,2,2,2,3
};
var dedupedList = list.ToHashSet().ToList();
Code language: C# (cs)
Note: The non-Linq equivalent is: new List<int>(new HashSet<int>(list)).
ToHashSet() iterates over the list, adding items to a HashSet. Only unique items are actually kept, which effectively eliminates duplicates. Then ToList() iterates over the HashSet and adds the items to a new list. At the end, you have a new list without the duplicates.
Remove duplicates in a loop
This is the most efficient and most flexible approach. You iterate through the list, adding items to a HashSet. If the item hasn’t been seen yet, you add it to a new list.
var list = new List<int>
{
1,1,1,2,2,2,3
};
var dedupedList = new List<int>(list.Count); //minimizes resizing when adding items
var hashSet = new HashSet<int>(list.Count);
foreach (var item in list)
{
if (hashSet.Add(item))
dedupedList.Add(item);
}
Code language: C# (cs)
HashSet.Add() returns true if it was able to add the item, which tells you that the item hasn’t been seen before, so you add it to the new list.
Remove duplicates in-place with List.RemoveAll()
You can remove duplicates in-place by using List.RemoveAll() with a HashSet. This is the best option if you want to modify the original list instead of creating a new list. Here’s how:
var list = new List<int>
{
1,1,1,2,2,2,3
};
var hashSet = new HashSet<int>();
list.RemoveAll(i => hashSet.Add(i) == false);
Code language: C# (cs)
Note: Since this is a negative boolean statement (“remove when Add() is false”), it’s a bit of a mental twist. To make it more clear, I prefer to use the “== false” comparison instead of “!Add()”.
With List.RemoveAll(), you pass in a lambda with conditions for removing an item. HashSet.Add() returns false if an item is a duplicate. Therefore, to remove duplicate items with List.RemoveAll(), check if HashSet.Add() returned false.
Remove duplicate objects based on a property
You can remove duplicate items based on a property by using DistinctBy() with ToList() (Linq methods). This is available in .NET 6+. Here’s an example:
using System.Linq;
var list = new List<Person>
{
new Person() { Name = "Bob", Age = 25 },
new Person() { Name = "Bob", Age = 40 },
new Person() { Name = "Linda", Age = 39 }
};
var dedupedList = list.DistinctBy(p => p.Name).ToList();
Code language: C# (cs)
An alternative way to do this without Linq methods (and without needing to add comparer methods etc…) is by looping over the list and keeping track of property values with a HashSet. This is 2x faster than the Linq methods.
var list = new List<Person>
{
new Person() { Name = "Bob", Age = 25 },
new Person() { Name = "Bob", Age = 40 },
new Person() { Name = "Linda", Age = 39 }
};
var dedupedList = new List<Person>(list.Count);
var hashSet = new HashSet<string>(list.Count);
foreach (var item in list)
{
if (hashSet.Add(item.Name))
dedupedList.Add(item);
}
Code language: C# (cs)
Inefficient approaches to avoid
The best you can do when removing duplicates is an O(n) algorithm. This is because you have to look at each element at least once. There are other approaches that are worse in theory and in practice. They only seem OK when ran on very small input. Avoid them.
Here are two examples of inefficient approaches to avoid:
- Sort the list so the duplicates are right next to each other, then iterate and remove the duplicates. Sorting has an average case time complexity of Θ(n log n).
- Iterate through the list, add items to a new list, and use List.Contains() to check if the item was already added. This is a nested loop, which has a worst case time complexity of O(n2).
These two approaches are orders of magnitude worse than the O(n) algorithm. They are worse in theory for sure. What about in practice? To compare the performance, I ran these on input sizes of 1k, 10k, and 100k.
Scenario - Duplicates (Half) - List Size 1000
Foreach (avg ms): 0.03
Sort-dedupe (avg ms): 0.41
Contains-dedupe (avg ms): 0.11
Scenario - Duplicates (Half) - List Size 10000
Foreach (avg ms): 0.13
Sort-dedupe (avg ms): 3.38
Contains-dedupe (avg ms): 5.72
Scenario - Duplicates (Half) - List Size 100000
Foreach (avg ms): 1.70
Sort-dedupe (avg ms): 37.03
Contains-dedupe (avg ms): 498.91
Code language: plaintext (plaintext)
Sure enough, the practical performance results line up with the theoretical Big-O comparison. For the example, the Contains() approach is up to 300x slower than the foreach approach.