Beginner Bioinformatics in Python — Part 2

Around 4 years ago I got some insight into the role data science of genomes could play in our future, and signed up for a coursera course called Bioinformatics for Beginners. 4 years later, I have finally started taking it seriously. In this blog, I try to detail out the problems that are solved in the course, and the type of code I wrote.

All the code can be found here.

If you’ve not read Part 1, I would recommending starting there.

In this part of the blog, I cover the programs developed as a part of week 1 of the course. Most of them were very simple, and the real complexity started only in the last couple of weeks. However, it definitely opened my eyes to biology in a new way. I realised that our body has billions of data points in the form of genes, and there is more unknown and more analysis that needs to be done in our genetic makeup than there is in many industries put together.

Here is a sampling of the theory and the kind of problems we solved.

Finding Replication Origin or Ori

The first part of the course focused on finding the replication origin, or ori. The ori is the place where the DNA replication starts. This is important to know for biomedical reasons. This information is also important while performing gene therapy, which is something that is regularly used in agriculture, and has also been used to cure certain auto-immune diseases in people.

When gene therapy happens, viral vectors (viruses carrying the copies of good gene) are injected in the cell. These genes go inside the cell and start replicating, thus providing the crucial protein missing in the patient. While injecting the viral vector, it is important to know where the ori is so that the manipulation does not affect the functionality of the ori.

Repetition of Patterns

For certain biological processes such as replication, specific nucleotide strings occur frequently within small regions of the genome. This because certain proteins can only bind to DNA if a certain string of nucleotides is present, and if there are more copies of the string, the chances of binding increase, and the chances of mutation disrupting the binding process decrease.

Problem 1: Pattern Count

We started off with a very simple warmup exercise. The first problem we had to solve in the course was pattern count — write a program that counts the number of occurrences of a certain pattern in a given text. For example, in “CACCGTCACAC”, the pattern “CAC” occurs 3 times.

This is to solve the problem of finding frequently occurring nucleotide combinations in the DNA, and thus discern if they have any role in the replication process.

This is a level 1 programming contest style problem, and the writing imperative solution was just a few minutes job. However I wanted to write a more functional style with immutable variables.

I wanted to use concise and functional syntax to solve the problem, so I came up with the following two implementations, both of which I’m not completely satisfied with. But I was okay with it because conciseness and functional style was my priority over anything else. I preferred the latter because it was more pythonic in style.

The first one —

def count_occurence(text, pattern):
return len(list(filter(lambda i: text[i:i + len(pattern)] == pattern, range(len(text) - len(pattern) + 1))))

And this one, which is more pythonic in style —

def count_occurence(text, pattern):
return sum(1 for i in range(len(text) - len(pattern) + 1) if text[i:i+len(pattern)] == pattern)

Problem 2: Most Frequent Words of Length K

A substring of length k in a given text is called a k-mer of the text.

Given a text and a number k, in this problem, we had to filter out the most frequently occurring words (or kmers or substrings) of a given length k. Here are sample input and output.

Input
Text: ACGTTGCATGTCGCATGATGCATGAGAGCT
Length: 4
Output
CATG GCAT

The course broke this into 2 assignments, creating a substring frequency map, and then using that map to find the most commonly occuring words.

This is again an easy problem to solve using imperative programming, as you can see below.

def FrequencyMap(Text, k):
freq = {}
n = len(Text)
for i in range(n-k+1):
Pattern = Text[i:i+k]
freq[Pattern] = 0
for i in range(n-k+1):
Pattern = Text[i:i+k]
freq[Pattern]+=1
return freq
def FrequentWords(Text, k):
words = []
freq = FrequencyMap(Text, k)
m = max(freq.values())
for key in freq:
if(freq[key]==m):
words.append(key)
return words

The code keeps getting lengthier and lengthier with imperative programming. Could we do something more functional in style?

A good quality functional programming solution turned out to be a bit of a challenge in this problem, perhaps because of my lack of experience. I was tempted by multiple different approaches. Then I thought back to the 3 criteria I had initially come up with:

- No Mutation
- Minimalistic Code
- Uses interesting features from Python

Here is the solution:


def combine_dictionary_counts(a, b):
return dict(list(a.items()) + list(b.items()) + [(k, a[k] + b[k]) for k in set(b) & set(a)])

def frequency_map_for_kmers(text, length):
return reduce(combine_dictionary_counts, [{text[i:i + length]: 1} for i in range(len(text) - length + 1)])
def most_frequent_substrings(text, length):
frequency_map = frequency_map_for_kmers(text, length)
max_frequency = max(frequency for kmer, frequency in frequency_map.items())
return [kmer for kmer, frequency in frequency_map.items() if frequency == max_frequency]

Since medium is not a great platform to share code, it is not apparent how concise the method to find the most frequent substring becomes. Here is a screenshot of the code to show how small it has become —

This code is not the easiest to understand, and I am okay with it. My goal in this whole experiment is not to optimise for readability. Conciseness over readability.

Problem 3: Reverse Complement

The nucleotides A and T are complements of each other, and so are C and G. Therefore, a nucleotide A would be paired with T, and G would be paired with C.

If we have a strand, say AG, then what would be its complement? The answer is a little surprising. It is not TC as you would expect. It is infact CT. This is because the strand and its paired complement run in the opposite direction. To find a complement of a strand, we need to find the complement of each letters, and then reverse the direction.

AG -> TC -> CT
GACGTAT -> CTGCATA -> ATACGTC

Here’s the program for it —

def ReverseComplement(Pattern):
return reverse(dna_complement(Pattern))

def dna_complement(pattern):
return ''.join(map(character_dna_complement, pattern))

def reverse(s):
return s if len(s) == 0 else reverse(s[1:]) + s[0]

def character_dna_complement(character):
return {"A": "T", "T": "A", "G": "C", "C": "G"}[character]

This is just a small step towards solving puzzles that govern our DNA. In the remaining weeks of the course, the problems get more complex, and the puzzles about our DNA more specific.

Realisations so far

  1. Biology is a lot more fun when it requires the person learning to think about it, form opinions, and solve problems.
  2. It is nice to see DS and A type problems occur in biology on such a regular basis. Proves that there are hard programming problems to solve everywhere, you just have to look for them.
  3. Seeing such concise code in Python was a pleasant change from Java, where it is difficult to reduce the amount of code beyond a certain point.
  4. Test First Approach helps, even if one is not doing TDD. This is true even for these tiny problems that we solved. Tests allowed me to play around with different paradigms and solutions.

What Next?

Fortunately, the problems get harder in Week 2. It is more challenging than Week 1, and Week 3 is even better.

To find the Ori, it is not sufficient to find the most frequently occurring k-mers or the reverse complement. There are further patterns to how nucleotides occur in the ori. The next blog will take us through more complex data transformations to solve these problems. On my side, I’ll post more examples of what I learnt trying to implement functional python, and how unit tests looked like for non-deterministic methods that dealt with optimisations.

You can read Part 3 here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store