Beginner Bioinformatics in Python — Part 7

This is a series of blog posts about my experience undergoing the beginner level bioinformatics course on Coursera, the problems solved, and the Python Codes generated.This purpose of this blog is for me to document my experiences for my future self, and not to optimize for readability.

If you’ve not read the previous parts, I would recommend starting there.

Link to Part 1
Link to Part 2
Link to Part 3
Link to Part 4
Link to Part 5
Link to Part 6

This post will track Week 4 of the coursera course. It is an extension of the algorithms written in Week 3. However, there will be many theoretical topics where I will only give a cursory explanation, rather than the full theory. This is because I don’t have enough technical knowledge or depth of understanding to explain the biology to anyone else. I don’t quite grasp many things myself. So we’ll focus more on documenting the code in this blog, less on biology.

Laplace’s Rule of Succession

Consider the following Profile Matrix —

{
‘A’:[0.33,0,1,0],
‘T’:[0.67,0,0,0],
‘C’:[0,0.33,0,1],
‘G’:[0,0.67,0,0]
}

The Profile Most Probable Kmer for it isTGAC . Therefore, if you take the Motif TGTC , that should be considered a fairly good match right? However, if we apply our probability function of the previous chapter to this Motif, its probability of generation will come to 0 (Zero)!

That definitely sounds incorrect, because TGTC differs from the highest probability Motif only by one character. It is so because at the second last position, the probability of generation of any character except A is zero.

This means we should change our algorithm to give some leeway for scenarios where everything is matching the profile matrix, but 1 or 2 characters are different.

That is done using Laplace’s rule of succession. When we generate our count matrix, we add the number 1 to each entry in the matrix. Then, to calculate the probability, we divide each entry by n+4 instead of n, where n is the number of strings.

Example —

Let’s start with the count matrix of the example we gave above:

The Original Count Matrix — 
{
‘A’:[1,0,3,0],
‘T’:[2,0,0,0],
‘C’:[0,1,0,3],
‘G’:[0,2,0,0]
}

We transform this by adding 1 to each entry as per Laplace’s Rule of Succession:

Transformed Count Matrix - 
{
‘A’:[2,1,4,1],
‘T’:[3,1,1,1],
‘C’:[1,2,1,4],
‘G’:[1,3,1,1]
}

The new Profile Matrix after the dampening added by the Laplace’s Rule —

The new Profile Matrix - {
‘A’:[2/7,1/7,4/7,1/7],
‘T’:[3/7,1/7,1/7,1/7],
‘C’:[1/7,2/7,1/7,4/7],
‘G’:[1/7,3/7,1/7,1/7]
}

This will give strings that are slightly different from the original string atleast some positive probability of generation, not zero.

Assignment: Count with Pseudocounts

This requires us to do the first transformation mentioned above. This is not a difficult challenge, and can be achieved by just modifying our count method a little bit —

def CountWithPseudocounts(Motifs):
motifs_count = Count(Motifs)
return {key: add_pseudocount_toarray(value) for key, value in motifs_count.items()}


def add_pseudocount_toarray(motif_count_array):
return list(map(lambda x: x + 1, motif_count_array))

Assignment 2: Profile with Pseudocounts

This is step 2 of the transformation I explained above — the part about generating the profile matrix according to Laplace’s Rule.

def ProfileWithPseudocounts(Motifs):
motifs_pseudocounts = CountWithPseudocounts(Motifs)
divisor = len(Motifs) + 4
return {key: list(map(lambda x: x / divisor, value)) for key, value in motifs_pseudocounts.items()}

Greedy Motif Search with Pseudocounts

The Greedy Motif Search that we’ve discussed in the previous post has the same problem that we solved with Laplace’s Rule of Succession.

Now that we’ve created a Profile matrix with Pseudocounts, let’s apply this new Profile matrix to our Greedy Motif Search and generate better results.

Example:

Sample Input:
3 5
GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG
Sample Output:
TTC
ATC
TTC
ATC
TTC

This does not require any effort from our side. We can simply reuse GreedyMotifSearch, but use the Profile with Pseudocounts instead of the older Profile method.

def GreedyMotifSearchWithPseudocounts(Dna, k, t):
motif_combinations = [best_motifs_for_given_iteration_with_pseudocounts(Dna, k, i) for i in range(len(Dna[0]) - k + 1)]
motif_scores = [Score(motifs) for motifs in motif_combinations]
return motif_combinations[motif_scores.index(min(motif_scores))]

def best_motifs_for_given_iteration_with_pseudocounts(dna, substring_length, index):
substring = dna[0][index: index + substring_length]
profile_matrix = ProfileWithPseudocounts([substring])
return recursive_compute_best_motifs_with_pseudocounts(dna, substring_length, [substring], profile_matrix, 1)

def recursive_compute_best_motifs_with_pseudocounts(dna, substring_length, previous_motifs, profile_matrix, row_index):
if row_index == len(dna):
return previous_motifs
motif_for_row_index = ProfileMostProbableKmer(dna[row_index], substring_length, profile_matrix)
current_motifs = previous_motifs + [motif_for_row_index]
return recursive_compute_best_motifs(dna, substring_length, current_motifs, ProfileWithPseudocounts(current_motifs), row_index + 1)

Duplication

If you’ve been following this blog series, you’ll realise that we’ve generated a whole set of duplicate methods — Count, Profile, Greedy search, and all the functions that were internally used by these functions.

There are 2 places where duplication can be removed. One is the calculation of Profile Matrix itself, and the second is the methods such as Greedy Motif search which use this Profile Matrix. We’ll only remove the duplication at the second place for now.

This can be achieved by injecting the function that calculates the Profile Matrix, instead of directly calling it. With this approach, we allow any profile function (even different ones in future) to be used for our Greedy Motif Search.

We simply create another function that expects the profile matrix generation function as a parameter, and passes it down to all child methods. Thus GreedyMotifSearch and GreedyMotifSearchWithPseudocounts just need to call this common method.

def greedy_search_with_custom_profile_function(Dna, k, t, profile_function):
motif_combinations = [best_motifs_for_given_iteration(Dna, k, i, profile_function) for i in range(len(Dna[0]) - k + 1)]
motif_scores = [Score(motifs) for motifs in motif_combinations]
return motif_combinations[motif_scores.index(min(motif_scores))]
def GreedyMotifSearchWithPseudocounts(Dna, k, t):
return greedy_search_with_custom_profile_function(Dna, k, t, ProfileWithPseudocounts)

What Next

This is the last week of the course, and we only have one more set of algorithms to cover. In the next post, we’ll learn about Randomized Motif Search. That allows us to generate hundreds (or potentially thousands) of runs of the algorithm, allowing us to get even better results for our Motif Search. The randomization part helps us in coming up with different possibilities, and the repeated iterations help us minimize the possibility of failure. Let’s explore all of it in the next part.

Read the the next part here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store