Beginner Bioinformatics in Python — Part 3

6 min readApr 28, 2022

This is a series of blog posts about my experience undergoing the beginner level bioinformatics course on Coursera, the problems solved, and the Python Codes generated.This purpose of this blog is for me to document my experiences for my future self, and not to optimize for readability.

All the code can be found here.

If you’ve not read the previous parts, I would recommend starting there.

Link to Part 1.
Link to Part 2.

In the week 2 of the course, the biological concepts became more complex. Infact, the course itself taught a simplified version of the biology involved in replication of the DNA.

I cannot summarise the biology in such a short blog, and neither do I fully understand it. So I’ll focus on the programming part of it, with a brief intro to what we’re solving for.

Assymetric Replication of the DNA, Forward and Reverse Half Strands

Think of the Genome like a ring. On one end is the replication origin (ori), and on the other is the replication terminus (ter). The genome has its complement attached to it, so A is complemented by T, C by G.

When the 2 rings (the genome and its complement) separate and start replicating, the unidirectionality of replication results in asymmetric replication between a strand and its complement, and the result is mutation. I will not go into details here, but you can read about the words Okazaki fragments, deamination, and asymmetric replication of genome.

The result of this whole process is that the forward half strand (one half of the ring from ori to ter) will have the least number of C nucleotides, and the reverse half strand (the other half of the ring from ori to ter) will have the most number of C nucleotides. Once we know the positions of the forward and reverse half strand, we can infer the position of ori from that information, since the strands meet at the ori and ter.

Problem 1: Find Ori and Ter From Statistics of the C Nucleotide in Forward and Reverse Half Strands

Given a string representing a genome and a single character symbol, create a symbol array. The i-th index of symbol array represents the number of occurrences of the given symbol in half of the string starting at that index. Also, assume that the array is circular (the last index is followed by the first index).

For example, here is the symbol array for the genome “CACGCC”, and the symbol “C”. The characters in bold are given just to symbolise that after the character at index 5, we have to assume that the array is circular, and is followed by the symbol at index 0. For example, the element at index 4 of the array has a value 3, because the characters at indices 4,5,0 of the genome are all ‘C’.

Symbol C A C G C C C A C
Index  0 1 2 3 4 5
Array  2 1 2 2 3 2

The imperative solution is simple —

def SymbolArray(Genome, symbol):
    array = {}
    n = len(Genome)
    ExtendedGenome = Genome + Genome[0:n//2]
    for i in range(n):
        array[i] = PatternCount(ExtendedGenome[i:i+(n//2)], symbol)
    return array

But we want a more functional looking solution, without mutating every element of the array. So we conclude on a solution that is not particularly beautiful, but does the job —

def SymbolArray(genome, symbol):
    return {i: half_string_symbol_count(genome, symbol, i) for i in range(0, len(genome))}def half_string_symbol_count(text, symbol, index):
    length = len(text)
    extended_text = text + text[0:length // 2]
    return PatternCount(extended_text[index:index+(length // 2)], symbol)

PS: There are places where method names are in snake_case, and some which have method names in Pascal_Case. All internal methods named by me are in snake_case, and the external ones given by the course are in PascalCase. Wherever the method name given by the course is not understandable, I create my own method, and call it from the method provided by the course to pass the assignment.

While this solution does the job, it is O(n²). Later, in the course, we create an O(n) solution by using the count of the previous index, and changing it by checking both ends of the half-array.

Here’s the O(n) solution with imperative code —

Out of all the problems I had attempted till this step, this was the toughest to write in a functional manner. The end solution is also not good looking.

Here’s the screenshot —

Here’s the actual code —

def FasterSymbolArray(genome, symbol):
    initial_value = half_string_symbol_count(genome, symbol, 0)
    genome_length = len(genome)
    extended_genome = genome + genome[0:genome_length // 2]
    indices = list(range(1, genome_length))
    constructed_list = list(accumulate(indices,
               lambda result, index: update_half_array_symbol_count(extended_genome, index,
                                                                    genome_length, result, symbol), initial=initial_value))
    return dict(enumerate(constructed_list))

def update_half_array_symbol_count(extended_genome, index, original_genome_length, previous_value,
                                   symbol):
    return previous_value \
           - int(extended_genome[index - 1] == symbol) \
           + int(extended_genome[index + original_genome_length // 2 - 1] == symbol)

Skew Array and Skew Diagram

On the forward half strand, there are more Gs than Cs, whereas on the reverse half strand there are more Cs than Gs. To find ori we can keep traversing the DNA, and keep a count of the total difference in the counts of G and C till that point. If the difference is increasing, we’re on the forward half strand, and if the difference starts decreasing, we’re on the reverse half strand. The point at which the difference between G and C count stops decreasing and starts increasing is the ori, and the other end is the ter.

Here is the example from the course —

Input: “CATGGGCATCGGCCATACGCC”

Array:

If we plot it on the graph, it is called a Skew Diagram. Here is a skew diagram of a genome —

The point at which the skew is minimum is our ori.

Problem 2: Create a Skew Array and Find Minimum Skew

Given a genome, create the skew array.

From the skew array, create another array of all the positions at which the skew diagram attains a minimum.

Skew Array:

Input: CATGGGCATCGGCCATACGCC
Output: 0 -1 -1 -1 0 1 2 1 1 1 0 1 2 1 0 0 0 0 -1 0 -1 -2

Minimum Skew:

Input: TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT
Output: 11 24

Here is the code —

def SkewArray(Genome):
    nucleotide_to_value = {"A":0, "T":0, "G":1, "C":-1}
    array = list(accumulate(Genome, lambda result, nucleotide: result + nucleotide_to_value[nucleotide], initial = 0))
    return array

def MinimumSkew(Genome):
    skew_array = SkewArray(Genome)
    minimum_skew = min(skew_array)
    return [index for index in range(len(skew_array)) if skew_array[index] == minimum_skew]

Or the screenshot —

What Next?

After all this effort, we can finally close the chapter of finding the ori. But there are lots of other problems to solve. For example, we’ve still not found the DnaA box. Week 2 of the course was only half done at this point, and those problems will follow.

You can read Part 4 here.

Beginner Bioinformatics in Python — Part 3

Written by Abhinav

No responses yet