This is a series of blog posts about my experience undergoing the beginner level bioinformatics course on Coursera, the problems solved, and the Python Codes generated.This purpose of this blog is for me to document my experiences for my future self, and not to optimize for readability.
If you’ve not read the previous parts, I would recommend starting there.
Link to Part 1
Link to Part 2
Link to Part 3
Link to Part 4
Link to Part 5
Link to Part 6
Link to Part 7
Link to Part 8
This is the second last post of this blog series, and the last one discussing an algorithm. We will focus on learning about Gibbs Sampling in this blog. In the last blog post of the series, we’ll focus on summary and lessons learnt.
Overall, the Bioinformatics course has done a great job at explaining the basics of bioinformatics. However, in this section (Gibbs Sampling), they do a poor job of explaining why it works. So in this section, we’ll learn the algorithm behind Gibbs Sampling, and I’ll try to explain why it works. However, since my own understanding about why it works is itself quite weak, you may not be satisfied with the explanation. Someone suggested this post to understand Gibbs Sampling in non-technical terms, though I was not able to make sense of it.
Gibbs Sampling
Let’s recap how Randomized Motif Search worked. We chose a set of motifs randomly, created a profile matrix, and then chose motifs for all DNA strands that were closest to the Profile Matrix. The problem with this approach is that it very quickly converges to an answer. The final set of motifs is very strongly dependent on the first profile matrix that we construct. So if the first profile matrix is not the best possible answer, probably the final answer might be totally off the mark.
We want our initial Motif search to be random, but at each step, we don’t want to converge to the answer very quickly. We want to let the profile matrix evolve with each step, so that it can converge to a better result.
Therefore, after we’ve constructed the Profile Matrix at a particular step, we choose a random index i. We then replace the k-mer only at this position with another k-mer according to a weighted die throw, and repeat the process.
Instead of going into the details right away, we’ll first solve a few problems and create functions that will eventually be used in finding the best motifs using Gibbs Sampling.
Normalize
Given a set of k-mers in a string and their probabilities according to the profile matrix, normalize the probablities so that their sum is 1.
Example —
Input:
{
"GCGT": 0.1,
"CGTT": 0.15,
"GTTA": 0.25
}Output:
{
"GCGT": 0.2,
"CGTT": 0.3,
"GTTA": 0.5
}
The solution is fairly simple, divide each entry by the sum of probabilities. So if the sum of probabilities is 5, each probability should become 1/5th of the original value.
The code —
def Normalize(Probabilities):
sum_of_probabilities = sum(Probabilities.values())
return {k: v/sum_of_probabilities for k, v in Probabilities.items()}
Weighted Die
The reason for creating the function above is that we want to create a weighted die out of all the k-mers and their normalized probabilities.
Note that this is different from what we did earlier. In previous chapters, given a profile matrix and a k-mer, we had calculated its probability according to a weighted die. This time, we’re given a set of k-mers and their probabilities, and we will create a weighted die that chooses a random k-mer, but in ratio of its probability.
Example —
Input:
{
"GCGT": 0.2,
"CGTT": 0.3,
"GTTA": 0.5
}Output:
One out of the 3 k-mers mentioned above, but with the probability mentioned against them. For example, out of every 100 tries, GTTA should be chosen roughly 50 times, and CGTT roughly 30 times.
The solution to this problem is to pick a range (say between 0 and 1), and assign each of the k-mers a portion of the range in proportion to their probabilities. For example, in the above example GCGT would have the range 0 to 0.2, CGTT 0.2 to 0.5, and GTTA 0.5 to 1. Then we generate a random number between 0 and 1, and pick the kmer that has this number in their range.
This leads to the the following code —
def WeightedDie(Probabilities):
key_to_range = key_vs_range(Probabilities)
random_fraction = random.uniform(0, 1)
return next(key for key, value in key_to_range.items() if value['lower'] < random_fraction <= value['upper'])
def key_vs_range(Probabilities):
key_list = Probabilities.keys()
value_list = map(lambda key: Probabilities[key], key_list)
upper_bound_list = reduce(lambda result, element: result + [result[-1] + element], value_list, [0])[1:]
lower_bound_list = [0] + upper_bound_list[:-1]
lower_and_upper_bounds = list(
map(lambda lower, upper: {'lower': lower, 'upper': upper}, lower_bound_list, upper_bound_list))
return dict(zip(key_list, lower_and_upper_bounds))
The most interesting thing about this problem was not the code, which I was able to write after putting in effort. The most interesting thing was figuring out how to write the test for it. Ultimately, I decided to call the function 10,000 times with the same input, and see that the results generated followed probabilities with a 5% margin. When writing the tests, I did not put in effort to go functional. Here is the test I wrote —
def test_weighted_die_throw(self):
output_count = {'x': 0, 'y': 0, 'abc': 0}
for i in range(10000):
current_output = WeightedDie({'x': 0.33, 'y': 0.5, 'abc': .17})
output_count[current_output] = output_count[current_output] + 1
self.assertTrue(4400 < output_count['y'] < 5500)
self.assertTrue(2800 < output_count['x'] < 3800)
self.assertTrue(1200 < output_count['abc'] < 2200)
Weighted Random K-Mer from Text
Now that we have our normalization and weighted die sorted out, here is the next problem that we need to solve —
Given a text, a profile matrix, and length k, calculate the probabilities of all the k-mers in the text, normalize them, and select one randomly by throwing a weighted die.
Example —
Input:
AAACCCAAACCC
{'A': [0.5, 0.1], 'C': [0.3, 0.2], 'G': [0.2, 0.4], 'T': [0.0, 0.3]}
2Probabilities:
{'AA': 0.05, 'AC': 0.1, 'CC': 0.06, 'CA': 0.03}Weighted Probabilities:
{'AA': 0.208, 'AC': 0.417, 'CC': 0.250, 'CA': 0.125}Output:
Any of the k-mers above according to their weighted probabilities
The code —
def ProfileGeneratedString(Text, profile, k):
text_length = len(Text)
probabilities = {Text[index:index+k]: Pr(Text[index:index+k], profile) for index in range(text_length - k + 1)}
weighted_probabilities = Normalize(probabilities)
return WeightedDie(weighted_probabilities)
Gibbs Sampling Algorithm
Now that we have our utility functions Normalize
, WeightedDie
, and ProfileGeneratedString
, we’ll try to understand the algorithm for Gibbs Sampling, and how these functions will be used.
Here is how Gibbs Sampling works —
- Start with randomly selected k-mers (one in each DNA string) just like we did in RandomizedMotifSearch.
- Randomly pick an index i between 0 and n-1. We will replace the string at this index.
- Temporarily remove the string at index i.
- After temporarily removing the string at index i, create the profile matrix with the remaining k-mers.
- Create a weighted die from Dna[i] and the Profile Matrix.
- Choose the new k-mer at index i by rolling the weighted die.
- Repeat the process N times, where N is a sufficiently large integer that we input (for example 100).
The course tries to explain it using the example below —
Initial Random Motifs
ttACCTtaac
gaTGTctgtc
ccgGCGTtag
cactaACGAg
cgtcagAGGTThe selected motifs are marked in bold. The actual correct answer (to which we want to reach) is marked in CAPITALS. In the random motifs we picked above, we've picked 3 motifs correctly, and only 2 incorrectly - at index 1 and 3 (0 based indices). Select a random indexi = 3Remove the entry at index ittACCTtaac ttACCTtaac
gaTGTctgtc gaTGTctgtc
ccgGCGTtag -----> ccgGCGTtag
cactaACGAg ----------
cgtcagAGGT cgtcagAGGTCalculate the Profile Matrix for the remaining Motifs with pseudocountMotifs:
ACCT
GTct
GCGT
AGGTProfile Matrix:
A: 3/8, 1/8, 1/8, 1/8
C: 1/8, 3/8, 3/8, 1/8
G: 3/8, 2/8, 3/8, 1/8
T: 1/8, 2/8, 1/8, 5/8Compute Probabilities for all k-mers: {
cact: 15/8⁴,
acta: 9/8⁴,
ctaA: 2/8⁴,
taAC: 1/8⁴,
aACG: 9/8⁴,
ACGA: 27/8⁴,
CGAg: 2/8⁴
}Normalize and choose one using a weighted die:
Let's say the die chose the most probable one - ACGAUpdate the K-mers:
ttACCTtaac
gaTGTctgtc
ccgGCGTtag
cactaACGAg
cgtcagAGGT
Here is the code:
def generate_new_motifs(dna_strings, kmer_length, current_motifs, index):
temp_motifs = current_motifs[:index] + current_motifs[index+1:]
profile_matrix = ProfileWithPseudocounts(temp_motifs)
new_motif = ProfileGeneratedString(dna_strings[index], profile_matrix, kmer_length)
return current_motifs[:index] + [new_motif] + current_motifs[index+1:]
def GibbsSampler(Dna, k, t, N):
random_indices = [random.randint(0, t-1) for i in range(N)]
initial_value = RandomMotifs(Dna, k, t)
output = list(reduce(
lambda current_motifs, index : generate_new_motifs(Dna, k, current_motifs, index),
random_indices,
initial_value))
return output
Tests: Writing the test for GibbsSampler was not straightforward, as there was no fixed approach I could go for. Ultimately I chose something that the course used a little later to verify our function — take a realistic dataset, run the Gibbs Sampler for mulitple iterations, and then assert that the best result is less than a certain threshold
Closing Notes
This finishes the course for us. As you can see, the programming involved in bioinformatics is very different from what we do as enterprise full-stack developers. In the next blog, I’ll write about my thoughts on the programming style followed, and lessons learnt.
You can read Part 10 here.