Beginner Bioinformatics in Python — Part 8

Abhinav
4 min readOct 9, 2022

--

This is a series of blog posts about my experience undergoing the beginner level bioinformatics course on Coursera, the problems solved, and the Python Codes generated.This purpose of this blog is for me to document my experiences for my future self, and not to optimize for readability.

If you’ve not read the previous parts, I would recommend starting there.

Link to Part 1
Link to Part 2
Link to Part 3
Link to Part 4
Link to Part 5
Link to Part 6
Link to Part 7

This post will continue to cover Week 4 (final week) of the course. This is the third-last blog post of the series. The topic we’ll cover in detail is Randomized Motif Search. The next post will cover Gibbs Sampling, and the last post will be the conclusion and closing thoughts.

Randomised Motif Search

Adding pseudo-counts to the Greedy Motif Search made it more accurate, and we’re now getting reasonably accurate results. However, the algorithm we’ve developed is not exact, and unless we check all permutations, we cannot be sure that the result we’re getting from it is the best one. As a matter of fact, if we spend enough time, we can design edge-cases where the Greedy Motif Search (even with pseudo-counts) would give us results that aren’t satisfactory.

To tackle this problem, we’ll try using a type of randomized algorithms called Monte-Carlo Algorithms. These algorithms are use quick calculations to give us the right answer with a small probability of failure. If we repeat this algorithm over multiple runs, we can minimize the chances of failure.

Motifs for Profile Matrix and DNA Strings

Given a Profile Matrix and a DNA string, we could find the best suited Motif in that DNA string by simply iterating over all k-mers.

Similarly, if we’re given a Profile Matrix and a list of DNA strings, we can find the best suited Motif in each of the DNA strings by applying this logic to each of the DNA strings. This is just about individually finding the best suited Motif in each string, not finding a common Motif.

For example, if we’re given the following Profile Matrix and list of DNA strings as input

We get the following set of Motifs as output. These motifs are not related to each other. That part we’ll tackle later.

To solve this problem, we can use the ProfileMostProbableKmer function that we’d written earlier —

def Motifs(Profile, Dna):
return list(map(lambda text: ProfileMostProbableKmer(text, len(Profile[next(iter(Profile))]), Profile), Dna))

Understanding the basics of Randomized Motif Search

Given a string of DNAs, let’s pick up a randomly chosen k-mer from each of them, and call them our initialMotifs .

Given these Motifs, se can generate the Profile Matrix — Profile(initialMotifs)

From this Profile Matrix, we can generate a new collection of k-mers —

Motifs(Profile(initialMotifs), DNA)

Our hope is that this newly generated set of Motifs using Motifs(Profile(initialMotifs), DNA) would have a better score than initialMotifs .

We can then use these newly calculated Motifs to again generate the Profile Matrix —

Profile(Motifs(Profile(initialMotifs), DNA))

and use that get an even better set of Motifs —

Motifs(Profile(Motifs(Profile(initialMotifs), DNA)), DNA)

We continue doing this iteration as long as our score is improving.

Implementing the Randomized Motif Search

We start by implementing a function called RandomMotifs(Dna, k, t)where Dna is the list of DNA strings, k is the length of the motif, and t is the number of DNA strings.

Then we implement another function called RandomizedMotifSearch(Dna, k, t) that starts with the initial Random Motifs, and applies the iterative process described above. For Randomized Motif Search, we’ll use many functions that we’ve written upto this point including Motifs (written just above), ProfileWithPseudocounts, and Score, which will internally use many other functions that we’ve implemented before.

def RandomizedMotifSearch(Dna, k, t):
random_motifs = RandomMotifs(Dna, k, t)
return converge_to_optimum_motifs(random_motifs, Dna)


def converge_to_optimum_motifs(current_motifs, dna):
newly_computed_motifs = Motifs(ProfileWithPseudocounts(current_motifs), dna)
if Score(newly_computed_motifs) == Score(current_motifs):
return current_motifs
return converge_to_optimum_motifs(newly_computed_motifs, dna)
import random

def RandomMotifs(Dna, k, t):
return list(map(lambda text: select_random_motif(text, k), Dna))


def select_random_motif(dna_text, motif_length):
position = random.randint(0, len(dna_text) - motif_length)
return dna_text[position: position + motif_length]

Best Randomized Motifs over Multiple Runs

We can run the Randomized Motif Search over N iterations (where N is an input). Then we can find the set of Motifs with the best score out of this lot, and choose that as our output. This helps us minimize the chances of missing the best set of Motifs.

This program is fairly simple —

def best_randomised_motifs(dna_array, motif_length, runs):
best_of_current_run = RandomizedMotifSearch(dna_array, motif_length, len(dna_array))
if runs == 1: return best_of_current_run
best_of_other_runs = best_randomised_motifs(dna_array, motif_length, runs - 1)
return best_of_current_run if Score(best_of_current_run) < Score(best_of_other_runs) else best_of_other_runs

What Next?

We’re coming to the end of the course. We’ll try to wrap up in 2 more blog posts, one on Gibbs sampling, and the other summarizing what we learnt about Programming and Biology by taking this course.

You can read Part 9 here.

--

--

Abhinav
Abhinav

Written by Abhinav

Educator, Founder @ Interleap

No responses yet