Hamming vs Levenshtein: Types of Edit Distance in Genomes

Hamming and Levenshtein Distances are different types of Edit Distances between 2 genome sequences. These are approximate matching algorithms that tell us how different a subsequence is from another reference subsequence.

Hamming Distance

Let’s take 2 Genome Sequences —

`1 - GATTCATC2 - GCTACATC`

The Hamming Distance between the 2 sequences is the number of substitutions required to convert sequence 1 to 2 or vice versa. In the above sequences, the Hamming Distance between the two sequences is 2. The Substitutions are highlighted below.

`1 - GATTCATC2 - GCTACATC`

Levenshtein Distance

The Levenshtein Distance between the 2 sequences is the number of insertions, deletions, or substitutions required to convert sequence 1 to sequence 2. In the above sequences, the Levenshtein Distance is also 2. However, there can be some scenarios where the Levenshtein Distance would be less than the Hamming Distance. An example is given below —

`1 - GTTCATAC2 - GCTACATC`

In the above example, the Hamming Distance between the 2 subsequences would be 5, which is quite high compared to the length of the sequences. However, the Levenshtein is just 3, as highlighted below —

`1 - G-TTCATAC2 - GCTACAT-C`

Levenshtein Distance can also be calculated between 2 sequences of unequal length, which is not true of Hamming Distance. Let’s take another example —

`1 - GTTCATC2 - GCTTCATC`

The Levenshtein Distance between the two sequences is just 1.

`1 - G-TTCATC2 - GCTTCATC`

Usage

During genome sequencing, a genome is cut into multiple small subsequences, and then those are sequenced in parallel. To make sense of the entire genome, they are then compared to the reference Human Genome, to make sense of what the actual genome looks like.

The genome being sequenced will never be exactly the same as the reference genome. So we need an approximate matching algorithm that tells us whether a particular position is the right position for a subsequence, with minimal mismatches.

For example, the genome being sequenced might have a subsequence `CCAGCTTAAT` , whereas at the reference Human Genome might have a subsequence `CGAGCTTAAT` at a particular position. We can then take an educated guess that this is the correct position for this snippet of the genome being sequenced.

Moreover, there can be sequencing errors. So even if there is an exact match between the genome and the reference genome, an approximate matching algorithm is needed to account for errors. In this case again an approximate matching algorithm helps.

Edit Distance acts as a building block in a set of algorithms that help sequence a genome.

--

--

Educator, Founder @ Interleap