Algorithm Explained: Determine if two strings have a substring in common

Problem statement: Given two strings, determine if they have a substring in common.

Example: Do “hello world” and “world” have a substring in common? Yes, they both have the substring “world”.

Approach

What are all the substrings of string?

The string “word” is four characters long. It contains 10 substrings between length 1 and 4. Here are the 10 substrings:

LengthSubstrings
4word
3wor, ord
2wo, or, rd
1w, o, r, d

At first glance, it may seem like we need to loop through all substrings in one string and check if that substring is contained in the other string. But we can do better.

First, the problem statement is only asking if the two strings have at least one substring in common. It’s not asking us to return the shared substrings.

Second, notice that single characters are substrings. All other substrings are composed of these single characters.

Therefore, the problem can be reduced down to checking if two strings have a single character in common.

Attempt 1: loop + string.Contains()

We can loop through string1’s characters and check if string2 contains that character. We can exit immediately upon finding a match:

foreach char in string1: if string2.Contains(char): return true return false
Code language: plaintext (plaintext)

Strings are arrays of characters. String.Contains() loops over all the characters in the array and returns true if the character exists.

In other words, it’s a nested loop. This is inefficient. In the worst case, it’s looping over string2’s characters M times, where M is the length of string1. It’s an O(n^2) algorithm.

For example, let’s say we are given “word” and “blah”. It would loop over all characters in “blah” four times:

Outer loopInner loop
wb, l, a, h
ob, l, a, h
rb, l, a, h
db, l, a, h

Attempt 2: loop + lookup

We can make this more efficient by saving characters from one string in a lookup. Then loop over the other string and use the lookup to check for a match.

hashset = {} foreach char in string1: hashset.Add(char) foreach char in string2: if hashset.Contains(char): return true return false
Code language: plaintext (plaintext)

Doing a lookup on a hashset is an O(1) operation. We are looping over each string exactly once, making this an O(n) algorithm. This is an order of magnitude improvement over the Attempt 1 O(n^2) algorithm in theory. In practice, using a hashset adds overhead cost. On short strings, it will actually be slower than the O(n^2) algorithm. I’ll show a performance comparison at the end of this article using various input sizes.

Test Cases

The following parameterized unit test has 6 test cases, starting with invalid input:

[DataRow("", "", false)] [DataRow(null, null, false)] [DataRow("aaa", "bbb", false)] [DataRow("aaa", "AAA", false)] [DataRow("aaa", "aAA", true)] [DataRow("aAA", "aaa", true)] [TestMethod] public void HaveACommonSubstringTest(string s1, string s2, bool expected) { //arrange and act var actual = Algorithm.HaveACommonSubstring(s1, s2); //assert Assert.AreEqual(expected, actual); }
Code language: C# (cs)

Code

using System.Collections.Generic; using System.Linq; public class Algorithm { public static bool HaveACommonSubstring(string s1, string s2) { if (string.IsNullOrEmpty(s1) || string.IsNullOrEmpty(s2)) return false; var set = new HashSet<char>(s1.Select(c => c)); foreach(var c in s2) { if (set.Contains(c)) return true; } return false; } }
Code language: C# (cs)

Performance comparison between the O(n) and O(n^2) algorithms in practice

This is a worst case scenario performance test. In the worst case scenario, the two strings don’t share a single character, which means the algorithm has to look at a every character in both strings. It’s testing strings with lengths from 26 to 260,000.

public void PerformanceTest() { int size = 1; StringBuilder sbS1 = new StringBuilder(); for(char a = 'a'; a <= 'z'; a++) { sbS1.Append(new string(a, size)); } StringBuilder sbS2 = new StringBuilder(); for (char a = 'A'; a <= 'Z'; a++) { sbS2.Append(new string(a, size)); } var s1 = sbS1.ToString(); var s2 = sbS2.ToString(); Stopwatch sw = new Stopwatch(); sw.Start(); Algorithm.LoopAndLookup(s1, s2); sw.Stop(); Console.WriteLine($"O(n) elapsed={sw.ElapsedMilliseconds}"); sw.Reset(); sw.Start(); Algorithm.LoopAndContains(s1, s2); sw.Stop(); Console.WriteLine($"O(n^2) elapsed={sw.ElapsedMilliseconds}"); sw.Reset(); }
Code language: C# (cs)

Here are the results:

String lengthO(n) algorithm total MSO(n^2) algorithm total MS
2640
26040
2,60040
13,00059
26,000637
260,000174,210

The overhead of using the hashset in the O(n) algorithm adds about 4 milliseconds. This is a constant.

The breakpoint where the O(n) starts to become faster than the O(n^2) algorithm is somewhere around a length of 13,000. After that point, the O(n^2) starts to become significantly slower.

This is a good reminder that Big-O analysis doesn’t give you the full picture when comparing algorithms. Big-O analysis is all about comparing the growth rates of algorithms. In theory, O(n) algorithms should always grow slower than O(n^2) algorithms. In practice, there may be a big constant that Big-O analysis ignores, and it may require large input for the theoretically faster algorithm to actually be faster.

The key is to know the potential input size you’re dealing with. If you know you’re dealing with small input, keep the code as simple as possible and don’t bother optimizing.

Leave a Comment