Beginner Bioinformatics in Python — Part 6

Abhinav
6 min readSep 20, 2022

--

This is a series of blog posts about my experience undergoing the beginner level bioinformatics course on Coursera, the problems solved, and the Python Codes generated.This purpose of this blog is for me to document my experiences for my future self, and not to optimize for readability.

If you’ve not read the previous parts, I would recommend starting there.

Link to Part 1
Link to Part 2
Link to Part 3
Link to Part 4
Link to Part 5

Note: This part starts tackling algorithms that will seem a little disjointed to you. Honestly, this chapter of the course is not well created, and the theory is a little too difficult for me to explain well in the blog. However, once you go through all of it, towards the end it starts making sense.

The Motif Finding Problem

We’ve figured out that if we’re given a list of Motifs, we can find the consensus string. But finding the motifs is no easy task. Therefore, we apply various techniques to find the motif. We will now focus on a few algorithms that try to find motifs and to calculate their accuracy.

Motif Probability

In the previous post, we learnt how to find the Profile Matrix given a list of Motifs. For example:

Motifs: 
AGA, TCA, TGA
The Profile Matrix:
{'A':[1/3,0,1], 'T':[2/3,0,0], 'C':[0,1/3,0], 'G':[0,2/3,0]}

But what does the Profile Matrix represent? The Profile Matrix represents the probability of a particular nucleotide occurring at a particular position in a Motif.

Therefore, if we use the Profile Matrix above as a Probability Matrix to generate the motifs, one third of the motifs will have A in the first position, and none of them will have T in the second position, because its probability is zero.

For example, ‘A’ has the probability .33 of occurring at the first position, G as .67 probability of occurring at the second position, and A has the probability 1 of occurring at the third position. Therefore, the probability of this profile matrix generating the string “AGA” is 0.33*0.67 = 0.22.

Similarly, it’s probability of generating TCA is 0.44, and the probability of generating AAA is 0. So if you generate 10,000 random motifs from this matrix, you would get around 44 that are ‘TCA’, and none that are ‘AAA’.

Another example given in the course —

Assignment: Given a k-mer and a profile matrix as input, calculate the probability of the Profile Matrix generating that k-mer.

Sample Input: 
AGA
{
'A':[0.33,0,1],
'T':[0.67,0,0],
'C':[0,0.33,0],
'G':[0,0.67,0]
}
Sample Output:
0.22

Let’s write an algorithm for this problem. We don’t want to pay attention to things like rounding errors, so our unit test will have to do an approximate match. Here’s the functional version of the solution. The course needed the method to be horribly named Pr , so we’ve another method that contains the logic, and is just called by Pr, so we know what’s going on.

Here’s the screenshot —

And the code —

def probability_of_generation(motif, profile_matrix):
return reduce(lambda x, y: x * y, [profile_matrix[motif[i]][i] for i in range(len(motif))], 1)

Most Probable Motif

We have a long DNA string, in which we have to find the best Motif of size k for a given profile matrix, ie. with the highest probability of generation. For example, the DNA string could be of length 9, and the k-mer is of size 3. How would we do that?

To do that, we would need to go through the entire string, find all substrings of length k, and apply our probability function of the previous section. The one that has the best probability of matching would be our best k-mer motif.

For Example —

Sample Input: 
AGATGACCA
3
{
'A':[0.33,0,1],
'T':[0.67,0,0],
'C':[0,0.33,0],
'G':[0,0.67,0]
}
Sample Output:
TGA

Explanation: In the above input string, the substrings of length 3 are “AGA”, “GAT”, “ATG”, “TGA”, “GAC”, “ACC”, “CCA”.

Out of these, the one with the highest probability match with the profile matrix is TGA.

We can solve this problem using the probability function created in the previous section —

def ProfileMostProbableKmer(text, k, profile):
kmers = [text[iterator:iterator + k] for iterator in range(len(text) - k + 1)]
probabilities = [probability_of_generation(kmer, profile) for kmer in kmers]
return kmers[probabilities.index(max(probabilities))]

This code could possibly be made even more concise by applying a function that can find the best k-mer based on maximum probability, but I find it hard to think of something right now. So we’ll go with this function for now.

Greedy Motif Search

Let’s go back to what we were discussing in the beginning of this whole chapter in the previous blog post. We had a bunch of DNAs, and certain proteins would bind to them. These proteins would bind to specific sequences that we called Motifs.

But if we’re just given a string of DNAs, and the length of the motif, but not the profile matrix or the information on which k-mers the proteins will bind to, is there a way for us to predict the Motifs?

We can predict the motifs by finding k-mers in the DNA that are the most similar to each other. In multiple strings of random characters, finding strings that are similar to each other is not a coincidence. For example, take the 3 strings — “ACGT”, “CCGT”, “AGTA”. The substrings ‘CGT’, ‘CGT’, ‘AGT’ are too similar to each other to be a coincidence. Therefore, it is highly likely that they are the Motifs to which the proteins will bind to.

We can convert this into a greedy algorithm —

  1. Iterate over all k-mers of the first DNA string. For each iteration, take that k-mer as our motif. Construct a profile matrix with just this substring, setting the probabilities to 1 for each character of this substring
  2. Move on to the second DNA string, and find the closest match to the first string. Based on this match, create another profile matrix.
  3. Move to the third and repeat.
  4. Repeat this whole process for all k-mers of the first DNA. Give a closeness score to each motif combination. The one with the best score wins. For scoring the motifs, we can use the score algorithm developed in the previous post.
  5. In the beginning, we can initialise our best motifs to the zeroeth substring of each string, so we have something to compare against.

It is difficult to explain this part without making the entire blog post about Greedy Motif Search. So I’ll link the blog post of an earlier student, which is far superior to the explanation given by the course itself. You can read the blog here.

Here is an example of the input and output. Here 3 is the size of the Motif, and 5 is the number of DNA strings —

Sample Input:3 5
GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG
Sample Output:CAG
CAG
CAA
CAA
CAA

This was yet another case where the iterative and the functional solutions are almost equal in length. If you’re not used to reading functional solutions, or have a little less knowledge of Python, you might find the iterative one easier to understand.

I still prefer the functional solution, purely because of its declarativeness. At every step, we’re specifying what we want, rather than adjusting variables in loops.

Iterative Solution:

Functional Solution:

The code —

def GreedyMotifSearch(Dna, k, t):
motif_combinations = [best_motifs_for_given_iteration(Dna, k, i) for i in range(len(Dna[0]) - k + 1)]
motif_scores = [Score(motifs) for motifs in motif_combinations]
return motif_combinations[motif_scores.index(min(motif_scores))]

def best_motifs_for_given_iteration(dna, substring_length, index):
substring = dna[0][index: index + substring_length]
profile_matrix = Profile([substring])
return recursive_compute_best_motifs(dna, substring_length, [substring], profile_matrix, 1)

def recursive_compute_best_motifs(dna, substring_length, previous_motifs, profile_matrix, row_index):
if row_index == len(dna):
return previous_motifs
motif_for_row_index = ProfileMostProbableKmer(dna[row_index], substring_length, profile_matrix)
current_motifs = previous_motifs + [motif_for_row_index]
return recursive_compute_best_motifs(dna, substring_length, current_motifs, Profile(current_motifs), row_index + 1)

What Next?

This blog post got quite heavy. I understand that the explanations I offered were inadequate in a lot of places.

Leave aside performance, Greedy Motif Search has another fatal flaw. Think of what it could be? (Hint: When a k-mer perfectly matches the profile matrix, but at one position, has a character that has a probability zero).

We will tackle this in the next chapter.

--

--