Beginner Bioinformatics in Python — Part 5

7 min readAug 27, 2022

This is a series of blog posts about my experience undergoing the beginner level bioinformatics course on Coursera, the problems solved, and the Python Codes generated.This purpose of this blog is for me to document my experiences for my future self, and not to optimize for readability.

If you’ve not read the previous parts, I would recommend starting there.

Link to Part 1
Link to Part 2
Link to Part 3
Link to Part 4

DNA Makes RNA Makes Protein

The process of transcription of DNA to RNA simply involves replacing the letter T with U. So TATACGAAA transcribes into the RNA string UAUACGAAA.

To translate the RNA into Protein, break the RNA into 3-mers called codons. Then use the diagram below to find the corresponding protein. So UAU-ACG-AAA translates to the protein YTK.

Regulatory Proteins

Some of these proteins in plants act as regulatory proteins — that is — they are able to turn certain genes on and off. For example, there are certain proteins in plants whose transcription is affected by sunlight. There are others whose transcription is affected by these proteins whose transcriptions are affected by sunlight.

For example, LHY and CC1 repress TOC1. So as LHY and CC1 increase, TOC1 production starts going down. However TOC1 promotes LHY and CC1 expression. So when TOC1 reduces, it starts the reduction of LHY and CC1, which will in turn start increasing TOC1 production. Additionally, sunlight activates the transcription of LHY and CC1, so this whole cycle also becomes dependent on sunlight. Therefore, LHY and CC1 peak during the day, and TOC1 peaks during the night.

These three genes in plants are able to encode proteins that bind to other genes, turning them on or off. Thus, they are able to regulate circadian rhythm in plants and animals.

Regulatory Motifs

The regulatory proteins described in the last section are also called Transcription Factors. They are able to turn other genes on and off by binding to a short DNA interval called Regulatory Motifs or Transcription Factor Binding Sites, that lie in the gene’s upstream region.

For example, CC1 described in the last section binds to “AAAAAATCT”, and thus regulates the circadian rythm in many of the plant genes.

Motif Finding involves finding such regions regulatory proteins bind to. These regions might not be exact — CC1 could also bind to “AAGAACTCT”.

This leads us to the computational problem we’re trying to solve: given a set of strings (genes) , find a set of characters of length k (motifs) that occurs in all the given strings (of length x), but with some minor variations. Find the best possible k-mer that could represent a motif.

Instead of attempting to explain how the final problem is solved, we’ll look at each small part of the problem that the course introduces, and then finally understand how all the parts come together.

Motif Count

Let’s move to the first problem of Week 3. Don’t think too much about why we’re solving this. We’ll solve this, and similar other problems. Eventually we’ll converge to an algorithm for motif finding. PS: The course does explain why we’re solving these problems along with the algorithm we’re looking for, but the explanations are too detailed (and hard for me to completely grasp!) for me to explain here.

Given a list of string Motifs, find the count of each DNA character at each position for all the strings combined, and represent it as a dictionary.

Given are examples of inputs and outputs.

Input: 
AAGCC
TTTCTOutput: 
{'A': [1,1,0,0,0], 'C': [0,0,0,2,1], 'G': [0,0,1,0,0], 'T': [0,0,0,1,0]}

Another example:

Input: 
AACGTA
CCCGTT
CACCTT
GGATTA
TTCCGGOutput: 
{'A': [1, 2, 1, 0, 0, 2], 'C': [2, 1, 4, 2, 0, 0], 'G': [1, 1, 0, 2, 1, 1], 'T': [1, 1, 0, 1, 4, 2]}

The code looks like this:

def Count(Motifs):
    return {symbol: count_array(symbol, Motifs) for symbol in "ACGT"}

def count_array(symbol, motifs):
   individual_count_arrays = [symbol_count_array_for_single_motif(symbol, motif) for motif in motifs]
   return [sum(elements) for elements in zip(*individual_count_arrays)]

def symbol_count_array_for_single_motif(symbol, motif):
    return [1 if current_symbol == symbol else 0 for current_symbol in motif]

If you want to look at the screenshot —

The most important learning from solving this problem was not about biology. It was to think in terms of data transformations. The course almost gave out the solution, but it was composed of nested iterative loops. To take that solution and convert it into a data transformation problem was the real challenge.

Motif Profile Matrix

We can use the above matrix to find the probability of any character occurring at a given position for the given motifs. For example, the beginning character has 2/3 probability of being T, and 1/3 probability of being A. Similarly, the middle character has 1 (100%) probability of being A. This way, given multiple motifs, we can figure out which characters have the highest probability at a particular position, and use that to predict out our actual Regulatory Motif to which the protein should bind to.

For this assignment, take the output of the previous step, and express the data in each column in the form of probabilities. To find this, we will need the count matrix, and divide each element by the number of motif strings. For example:

Input: 
AGA
TCA
TCAThe Count Matrix:
{'A':[1,0,1], 'T':[2,0,0], 'C':[0,2,0], 'G':[0,1,0]}The Profile Matrix:
{'A':[1/3,0,1], 'T':[2/3,0,0], 'C':[0,2/3,0], 'G':[0,1/3,0]}

Here’s the code for the assignment:

def Profile(Motifs):
    count = Count(Motifs)
    number_of_motifs = len(Motifs)
    return {symbol: motif_profile_for_symbol(count[symbol], number_of_motifs) for symbol in "AGCT"}


def motif_profile_for_symbol(count_array_for_symbol, number_of_motifs):
    return list(map(lambda x: x / number_of_motifs, count_array_for_symbol))

Consensus String

Let’s think back to what we were trying to achieve. Given a set of motifs the the regulatory proteins binding to, we can find the ideal transcription factor binding site by finding the most probable neucleotide at each position. That is because all the motifs are going to be similar, but deviate from the ideal binding site by a few characters. So if we can find the most commonly occurring nucleotide at each position, we can figure out our Regulatory Motif. This is also called the consensus string.

For example, let’s consider the Profile Matrix we generated in the previous example —

Input: 
AGA
TCA
TCAProfile Matrix:
{'A':[1/3,0,1], 'T':[2/3,0,0], 'C':[0,2/3,0], 'G':[0,1/3,0]}

We can see that the nucleotides T, C, and A have the highest probability of occurrence at positions 0, 1, 2 respectively. Therefore, our consensus string would be TCA.

Below is another image from the course explaining the process of finding the consensus.

The course suggests an ugly looking iterative program to find the consensus string, that includes 3 levels of indentation.

But we don’t write programs like that. We want our code to be as concise as possible.

To write this code in a functional manner, we have to choose one of the 2 approaches — Look at the values of profile at each index, choose the maximum, or second — Find the consensus string for any 2 symbols, then combine it to find the consensus string for the third etc.

So here’s what I came up with:

def Consensus(Motifs):
    count_matrix = Count(Motifs)
    symbols = "ACGT"
    string_length = len(Motifs[0])
    consensus_list = [consensus_symbol_at_index(count_matrix, i, symbols) for i in range(string_length)]
    return ''.join(consensus_list)

def consensus_symbol_at_index(count_matrix, index, symbols):
    count_to_symbols_tuple = {count_matrix[symbol][index]: symbol for symbol in symbols}
    max_count = max(count_to_symbols_tuple.keys())
    return count_to_symbols_tuple[max_count]

While it looks similar in length to the code above, notice that the core of the logic is just 2–3 lines. The remaining is broken into variable declarations to make things easier to understand.

But if we really want to go for concise code, we can make it a lot shorter by inlining the variables.

Consensus String Score

The consensus string score tells us how much the consensus string is deviating from the motifs that we have. If the motifs are similar to each other, the consensus string will have a very low score, indicating a higher quality solution. If the motifs are very different from each other, the resultant consensus string will have a high score, indicating low confidence. The consensus string score is denoted by Score(Motifs) in the diagram above in the Consensus section.

To write the algorithm for consensus string score, we used the Hamming Distance we had written a few chapters back, find the hamming distance between the consensus string and each motif, and then sum the hamming distances.

def HammingDistance(p, q):
    return sum(list(map(lambda x, y: int(x != y), list(p), list(q))))

def Score(Motifs):
    return sum(map(lambda motif: HammingDistance(motif, Consensus(Motifs)), Motifs))

Closing

Let’s close this blog with a summary of where we’ve reached in the course. At this point, we’ve covered 2/3rds of Week 3 of the course, with one more week to go. Week 4 is the most difficult, and the quality of explanations, diagrams, etc everything becomes worse.

However, the good thing is that the complexity of algorithms increases significantly. Therefore, the remaining parts of the blog will cover more complex algorithms, made more difficult by our insistence that we write pythonic, declarative, and functional code.

You can read Part 6 here.

Beginner Bioinformatics in Python — Part 5

Written by Abhinav