One day I decided to challenge myself by trying to implement multithreaded quicksort. I wanted to see how it would compare to the built-in sort function – Array.Sort().

I came up with two algorithms that were 2-4x faster than Array.Sort():

- Top-down: divide-fork-sort-merge
- Bottom-up: quicksort with fork-on-recursion

After continuing to tinker, in attempts to further optimize, I came across PLINQ.AsParallel().OrderBy(). After reading the MSDN documentation about this, I realized it was the same approach as my divide-fork-sort-merge algorithm. I did a performance test and it was also 2-4x faster than Array.Sort().

In the end, I would use the built-in PLINQ.AsParallel().OrderBy() in production software if the input were relatively large. Otherwise I would use Array.Sort(). In general it’s a good idea to use built-in functionality instead of rolling your own, because it keeps your code clean and simple.

# Primer on quicksort and why I wanted to make it threaded

## What is quicksort?

It’s a divide-and-conquer sorting algorithm that works like this:

```
Pick a pivot
Partition the array around the pivot
left subarray = any element <= pivot
right subarray = any element > pivot
Quicksort(left subarray)
Quicksort(right subarray)
```

Here’s what this looks like:

## Why divide-and-conquer algorithms, like quicksort, benefit from multithreading

Multiple threads help speed things up if:

- The processor has multiple cores, and therefore can run multiple threads concurrently.
- The work can be divided into non-overlapping partitions

Because quicksort divides the array into two non-overlapping subarrays at each step, it meets the second condition, and the work can be parallelized.

# Comparing performance

To compare performance I generated an array with random elements, then copied this array into other arrays for each algorithm I was testing. This was to make sure the algorithms were sorting the exact same sequence of elements. Then I used System.Diagnostics.Stopwatch to measure the elapsed time of each algorithm.

var approach1Array = SortUtility.GenRandomArray<string>(size: 10_000_000); Console.WriteLine("Size " + approach1Array.Length); var approach2Array = new string[approach1Array.Length]; Array.Copy(approach1Array, approach2Array, approach2Array.Length); Stopwatch approach1Stopwatch = new Stopwatch(); approach1Stopwatch.Start(); Array.Sort(approach1Array); approach1Stopwatch.Stop(); Console.WriteLine($"Array.Sort - Is sorted? {SortUtility.IsSorted(approach1Array)}. ElapsedMS={approach1Stopwatch.ElapsedMilliseconds}"); Stopwatch approach2Stopwatch = new Stopwatch(); approach2Stopwatch.Start(); approach2Array = approach2Array.AsParallel().OrderBy(t => t).ToArray(); approach2Stopwatch.Stop(); Console.WriteLine($"PLINQ.Sort - Is sorted? {SortUtility.IsSorted(approach2Array)}. ElapsedMS={approach2Stopwatch.ElapsedMilliseconds}");

Here are the utility functions I used for generating input and verifying sorted order.

public static T[] GenRandomArray<T>(int size = 10000) { var a = new T[size]; Random r = new Random(); for (int i = 0; i < size; i++) { a[i] = (T)Convert.ChangeType(r.Next(Int32.MinValue, Int32.MaxValue), typeof(T)); } return a; } public static bool IsSorted<T>(T[] a) where T : IComparable<T> { if (!a.Any()) return true; var prev = a.First(); for (int i = 1; i < a.Length; i++) { if (a[i].CompareTo(prev) < 0) return false; prev = a[i]; } return true; }

# Bottom-up: quicksort with fork-on-recursion

I made a modification to the quicksort algorithm. After partitioning, it’s quicksorting the left and the right subarrays in their own threads concurrently.

```
Pick a pivot
Partition the array around the pivot
left subarray = any element <= pivot
right subarray = any element > pivot
Fork Quicksort(left subarray)
Fork Quicksort(right subarray)
```

## Diagram

To illustrate this, every time the call tree branches, it is also forking the work.

## Code

public class ThreadedQuickSort<T> where T : IComparable<T> { public async Task QuickSort(T[] arr) { await QuickSort(arr, 0, arr.Length - 1); } private async Task QuickSort(T[] arr, int left, int right) { if (right <= left) return; int lt = left; int gt = right; var pivot = arr[left]; int i = left + 1; while (i <= gt) { int cmp = arr[i].CompareTo(pivot); if (cmp < 0) Swap(arr, lt++, i++); else if (cmp > 0) Swap(arr, i, gt--); else i++; } var t1 = Task.Run(() => QuickSort(arr, left, lt - 1)); var t2 = Task.Run(() => QuickSort(arr, gt + 1, right)); await Task.WhenAll(t1, t2).ConfigureAwait(false); } private void Swap(T[] a, int i, int j) { var swap = a[i]; a[i] = a[j]; a[j] = swap; } }

## Performance

## What doesn’t work

The key problem is the top-level thread needs to know when all child threads have completed. The simplest way I found to do this was by using await/async and Tasks.

I attempted to spawn new threads, and then calling Thread.Join(). With a large enough input, this quickly resulted in OutOfMemoryExceptions.

I tried to use ThreadPool threads. As mentioned above, the top-level thread needs to know about the child threads and when they have completed. This can’t be done with recursion, because there’s a race condition. It can be done using iterative quicksort – using CountdownEvent to signal the top-level waiter – but with this approach you have to partition all the way down to a predetermined limit (let’s say 1024 elements), and then sort those in a new thread. This defeats the purpose of multithreading. The gains in performance come from dividing the work into multiple threads right away.

# Top-down: divide-fork-sort-merge

I randomly thought of this algorithm, wrote it down, then implemented it. Later on I found out that this is approach is the Fork-Join pattern.

```
Divide array into 4 subarrays
For each subarray
Fork Sort(subarray)
4-way merge subarrays
```

## Diagram

## Code

public class ForkJoinSort<T> where T : IComparable<T> { public async Task Sort(T[] a) { var arrs = Divide(a); List<Task> tasks = new List<Task>(); foreach (var arr in arrs) { var tmp = arr; tasks.Add(Task.Run(() => { Array.Sort(tmp); })); } await Task.WhenAll(tasks.ToArray()).ConfigureAwait(false); Merge(a, new List<Arr> { new Arr() { a = arrs[0], ptr = 0 }, new Arr() { a = arrs[1], ptr = 0 }, new Arr() { a = arrs[2], ptr = 0 }, new Arr() { a = arrs[3], ptr = 0 }, }); } private class Arr { public T[] a; public int ptr; } private static void Merge(T[] destArr, List<Arr> arrs) { T minValue; Arr min; for (int i = 0; i < destArr.Length; i++) { var firstArr = arrs.First(); minValue = firstArr.a[firstArr.ptr]; min = firstArr; for (int j = 1; j < arrs.Count; j++) { if (arrs[j].a[arrs[j].ptr].CompareTo(minValue) < 0) { minValue = arrs[j].a[arrs[j].ptr]; min = arrs[j]; } } destArr[i] = minValue; min.ptr++; if (min.ptr >= min.a.Length) { arrs.Remove(min); } } } private List<T[]> Divide(T[] a) { List<T[]> arrs = new List<T[]>(); int divisionSize = a.Length / 4; var a1 = new T[divisionSize]; var a2 = new T[divisionSize]; var a3 = new T[divisionSize]; var a4 = new T[a.Length - (divisionSize * 3)]; Array.Copy(a, 0, a1, 0, a1.Length); Array.Copy(a, divisionSize, a2, 0, a2.Length); Array.Copy(a, divisionSize * 2, a3, 0, a3.Length); Array.Copy(a, divisionSize * 3, a4, 0, a4.Length); return new List<T[]>() { a1, a3, a2, a4 }; } }

## Performance

## What doesn’t work

Divide takes a trivial amount of time, Sort takes 80%, and Merge takes 20% of the time.

It may seem odd that the array is divided into 4 equal parts. The main temptation is to try to partition the array, such that a1 < a2 < a3 < a4, which then makes the merging step trivial. The primary problem is **picking a good pivot is hard. **The same is true for quicksort itself. Why? Because, in order to pick the best pivot, you really need the middle-most element, which would require n^2 comparisons (in other words, you have to sort first in order to pick a good partition).

By random chance, you’ll sometimes end up with the left partition having 95% of the elements, thereby making the multithreading pointless. By random chance, you’ll also end up with the perfect partition sometimes. Therefore, it makes more sense to just evenly partition the arrays.

The other main optimization temptation is to detect “streaks” while merging, and then bulk-copying to the target array. However, this suffers from the same problem as what was mentioned above. In the worst case, the mins will never be pulled from the same array twice in a row. In most cases, the streaks will be small, and not worth the overhead of trying to keep track of “streaks.”

It’s interesting that simplicity is the best approach here due to randomness making “smarter” approaches ineffective.

# PLINQ.AsParallel().OrderBy()

This is the built-in function. It uses the Fork-Join pattern. Here’s how to use it:

arr = arr.AsParallel().OrderBy(t => t).ToArray();

That’s it. Simple.

There are two reasons why I would always choose this over my homemade algorithms:

- It abstract away complexity, making my code very simple
- It
*usually*outperforms my algorithms by a little bit.