Beginner Bioinformatics in Python — Part 4
This is a series of blog posts about my experience undergoing the beginner level bioinformatics course on Coursera, the problems solved, and the Python Codes generated.This purpose of this blog is for me to document my experiences for my future self, and not to optimize for readability.
If you’ve not read the previous parts, I would recommend starting there.
The initiation of replication is mediated by DnaA, a protein that binds to a short segment within the ori known as a DnaA box. You can think of the DnaA box as a message within the DNA sequence telling DnaA to start binding at that location. The more number of DnaA boxes, the easier it is for DnaA to bind there. If you remember the frequent words problem we talked about in an earlier post, it was to figure out where the DnaA box lies.
However, in real life Dna, mutation keeps happening. Therefore, sometimes in the DNA, instead of getting exact copies of a Dna box, you could get approximate matches. In addition to approximate matches, you would get approximate matches of reverse complements also.
For example, if the DNA box is “AGCCT”, there might be exact copies, or approximate matches, such as “CGCCT”, or reverse complements “AGGCT”, or approximate matches of reverse complements, such as “AGTCT”.
The approximate matches might have a difference of 1 character, or more. This difference is called Hamming Distance.
Problem 1: Hamming Distance
Given 2 strings of equal length, the Hamming Distance between the 2 strings is the number of mismatched characters between the 2 strings. A mismatch occurs at position ‘‘n’ if the nth character of string 1 and nth character of string 2 are different.
This brings us to the next problem. Given 2 strings of equal length, find the Hamming Distance between the 2 strings.
Here is the code
def HammingDistance(p, q):
return reduce(lambda a, b: a + b, list(map(lambda x, y: int(x != y), list(p), list(q))), 0)
The Hamming Distance is important because it can help us find if there are approximate matches of a pattern in a DNA string, based on our observation that the DnaA box may appear with slight variations.
Problem 2: Approximate Pattern Count
The last problem of Week 2 was finding the approximate pattern count. This is the same as the Pattern Count we solved, but with a maximum Hamming Distance threshold.
Given input strings Text and Pattern as well as an integer d, we extend the definition of PatternCount to the function ApproximatePatternCount(Pattern, Text, d). This function computes the number of occurrences of Pattern in Text with at most d mismatches.
def pattern_count_with_mismatch(text, pattern, mismatch_threshold):
return sum(1 for i in range(len(text) - len(pattern) + 1) if HammingDistance(text[i:i + len(pattern)], pattern) <= mismatch_threshold)
Or the screenshot
This brings us to the end of week 2, and the problem of finding Ori and DnaA box. In week 3, the course moves to more complex patterns in the DNA, and the conclusions we can draw from those patterns.