HackerRank – Two Strings solution

In this article, I’ll explain how to solve the Two Strings algorithm problem on HackerRank.

Problem statement: Given two strings, determine if they have a substring in common. The strings can have up to 100k characters.

Example: Given “hello world” and “world”, do they have a substring in common? Yes, they many substrings in common. One of them is “world”.

Approach

What are all the substrings of a string? The string “word” is four characters long. It contains 10 substrings between length 1 and 4. Here are the 10 substrings:

LengthSubstrings
4word
3wor, ord
2wo, or, rd
1w, o, r, d

At first glance, it may seem like we need to loop through all substrings in one string and check if that substring is contained in the other string. But we can do better.

  1. It’s only asking if two strings have any common substring. It’s not asking us to return the shared substrings.
  2. Single characters are substrings.

Therefore, we can reduce this problem down to checking if the two strings share at least one common character.

Algorithm 1: loop + string.Contains()

We can loop through string1’s characters and check if string2 contains that character. We can exit immediately upon finding a match. The following pseudocode shows this algorithm:

foreach char in string1:
   if string2.Contains(char):
      return true

return false
Code language: plaintext (plaintext)

Strings are arrays of characters. String.Contains() checks if a string has a character by looping over all the characters in the character array.

In other words, it’s a nested loop. This seems inefficient. In the worst case, it’s looping over string2’s characters M times, where M is the length of string1. It’s an O(n^2) algorithm.

For example, let’s say we are given “word” and “blah”. It would loop over all characters in “blah” four times:

Outer loopInner loop
wb, l, a, h
ob, l, a, h
rb, l, a, h
db, l, a, h

Algorithm 2: Loop + Lookup

We can make this more efficient by saving characters from one string in a lookup. Then loop over the other string and use the lookup to check for a match. The following pseudocode shows this algorithm:

hashset = {}
foreach char in string1:
    hashset.Add(char)

foreach char in string2:
    if hashset.Contains(char):
       return true

return false
Code language: plaintext (plaintext)

Doing a lookup on a hashset is an O(1) operation. We are looping over each string exactly once, making this an O(n) algorithm. This is an order of magnitude improvement over the O(n^2) algorithm in theory. In practice, using a hashset adds overhead cost.

Based on performance results, this O(n) algorithm is faster than the O(n^2) algorithm when the strings have over 10k characters. Because the input strings can potentially have up to 100k characters, we’ll use the O(n) algorithm to avoid performance problems.

Code

Here’s an implementation of the O(n) loop + lookup algorithm (written in C#):

using System.Collections.Generic;
using System.Linq;

public class Algorithm
{
	public static bool HaveACommonSubstring(string s1, string s2)
	{
		if (string.IsNullOrEmpty(s1) || string.IsNullOrEmpty(s2))
			return false;

		var set = new HashSet<char>(s1.Select(c => c));

		foreach(var c in s2)
		{
			if (set.Contains(c))
				return true;
		}

		return false;
	}
}
Code language: C# (cs)

Performance comparison between the O(n) and O(n^2) algorithms in practice

This is a worst case scenario performance test. In the worst case scenario, the two strings don’t share a single character, which means the algorithm has to look at a every character in both strings. It’s testing strings with lengths from 26 to 260,000.

public void PerformanceTest()
{
	int size = 1;

	StringBuilder sbS1 = new StringBuilder();
	for(char a = 'a'; a <= 'z'; a++)
	{
		sbS1.Append(new string(a, size));
	}

	StringBuilder sbS2 = new StringBuilder();
	for (char a = 'A'; a <= 'Z'; a++)
	{
		sbS2.Append(new string(a, size));
	}

	var s1 = sbS1.ToString();
	var s2 = sbS2.ToString();

	Stopwatch sw = new Stopwatch();
	sw.Start();
	Algorithm.LoopAndLookup(s1, s2);
	sw.Stop();
	Console.WriteLine($"O(n) elapsed={sw.ElapsedMilliseconds}");
	sw.Reset();

	sw.Start();
	Algorithm.LoopAndContains(s1, s2);
	sw.Stop();
	Console.WriteLine($"O(n^2) elapsed={sw.ElapsedMilliseconds}");
	sw.Reset();

}
Code language: C# (cs)

Here are the results:

String lengthO(n) algorithm total MSO(n^2) algorithm total MS
2640
26040
2,60040
13,00059
26,000637
260,000174,210

The overhead of using the hashset in the O(n) algorithm adds about 4 milliseconds. This is a constant.

The breakpoint where the O(n) starts to become faster than the O(n^2) algorithm is somewhere before 13k characters (probably around 10k). After that point, the O(n^2) starts to become significantly slower.

This is a good reminder that Big-O analysis doesn’t give you the full picture when comparing algorithms. Big-O analysis is all about comparing the growth rates of algorithms. In theory, O(n) algorithms should always grow slower than O(n^2) algorithms. In practice, there may be a big constant that Big-O analysis ignores, and it may require large input for the theoretically faster algorithm to actually be faster.

The key is to know the potential input size you’re dealing with. If you know you’re dealing with small input, keep the code as simple as possible and don’t bother optimizing.