List<T>.Contains() performance
Published: 9/16/2012 10:35:55 PM
I was troubleshooting a performance issue with a co-worker today when we discovered that a particular call to List<int>.Contains() was causing a huge performance hit. At that moment it struck me that I never use to consider List<T> or Dictionary<T> to be bad when it comes to performance -- I just assume they are OK. However, based on today's finding I decided to test the performance.
It turns out that Contains() performs a linear search when it tries to find out if a value exists in the list. Considering this then it seems quite obvious that the performance depends on the location of the value in the list. If we search for a value that happens to be the first in the list then the method will return faster than if the value would have been the last. To test this theory I wrote a simple console application that created a list of 3.000 integers and then made 600.000 calls to Contains(). In the first scenario I searched for the first item, in the second scenario I searched for the last item, and in the third scenario I searched for a random number. The source code is as follows:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace ListTContains
{
class Program
{
static void Main(string[] args)
{
List<int> list = Enumerable.Range(1, 3000).ToList();
Random random = new Random();
Stopwatch sw = new Stopwatch();
sw.Start();
for(int i = 0; i < 600000; i++)
list.Contains(1);
sw.Stop();
Console.WriteLine("List<T>.Contains(1): {0}", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
for(int i = 0; i < 600000; i++)
list.Contains(3000);
sw.Stop();
Console.WriteLine("List<T>.Contains(3000): {0}", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
for(int i = 0; i < 600000; i++)
list.Contains(random.Next(3000));
sw.Stop();
Console.WriteLine("List<T>.Contains(random): {0}", sw.ElapsedMilliseconds);
}
}
}
The output from the program:
List<T>.Contains(1): 16
List<T>.Contains(3000): 9107
List<T>.Contains(random): 4494
The output is pretty much the expected: it takes no time to find the first item, and the more items you have to compare before you find the right one, the longer it will take.
So what is the solution? The HashSet<T> class was introduced in .NET Framework 3.5, and is described to "provide high-performance set operations". That sounded promising, so I modified my application to use HashSet instead of List and took it for a run.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace ListTContains
{
class Program
{
static void Main(string[] args)
{
HashSet<int> list = new HashSet<int>(Enumerable.Range(1, 3000));
Random random = new Random();
Stopwatch sw = new Stopwatch();
sw.Start();
for(int i = 0; i < 600000; i++)
list.Contains(1);
sw.Stop();
Console.WriteLine("HashSet<T>.Contains(1): {0}", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
for(int i = 0; i < 600000; i++)
list.Contains(3000);
sw.Stop();
Console.WriteLine("HashSet<T>.Contains(3000): {0}", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
for(int i = 0; i < 600000; i++)
list.Contains(random.Next(3000));
sw.Stop();
Console.WriteLine("HashSet<T>.Contains(random): {0}", sw.ElapsedMilliseconds);
}
}
}
And the output:
HashSet<T>.Contains(1): 11
HashSet<T>.Contains(3000): 11
HashSet<T>.Contains(random): 24
That's quite a neat performance improvement with minimum effort! Random searches now takes 24 ms instead of 4494 ms. List<T> just made it to my list of things to look at when troubleshooting performance.