Python Code for the Beginner Bioinformatics Course — Part 1
I find the topic of genomes, and the programming involved in it fascinating. Coursera offers a reasonably good beginner level course that introduces this topic called Bioinformatics for Beginners. I had originally enrolled in this course 4 years ago, but started working on it seriously only this year. This series of blogs aims to cover the python programs and the programming paradigms written as a part of the course. It includes minimal theory to the biological part, so that the reader can connect the dots.
Background
DNA consists of a series of neucliotides A, T, C, G. A and T are compliments of each other, and C and G are compliments of each other. So a strand of DNA consisting of various alphabets such as “AGCAATGC” would be attached to another strand built of its compliments — “TCGTTACG”.
When replication happens, these 2 strands separate, and create their own compliment pair. Thus we now get 2 pairs of similar DNA.
Since DNA itself is a collection of large amounts of data, understanding the DNA becomes an interesting problem to solve, and uses mathematics and programming.
The course starts with very simple programs and create rudimentary solutions to problems. With time, the problems get a lot more complex.
Course Programming Paradigms
Language: The course required the programming problems to be solved in Python. It used certain conventions which were different from PEP8, such as method names starting with capitals. The course prioritised bioinformatics conventions over Python conventions.
Paradigm: The course uses procedural programming. This makes sense in the beginning because to keep things simple. As data transformations become complex, procedural programs become convoluted, with loads of nesting and conditions. It seems very similar to the coding style used by graduates fresh out of college.
Programming Paradigms Used by Me
I wanted to use this chance to experiment with paradigms, and also learn a bit of Python. After a while, I created the following rules that would force me to leave my comfortable Object Oriented style of programming, and try something new —
No Classes or Types: I wanted to move away from my standard way of working, and see how data transformations work without types.
Minimalistic Code: I prioritised writing less code over almost anything else, including readability. Since there was so much data transformation happening, having concise code made it easy to understand the overall picture.
Immutability: Most of the work I’ve done in my career has used imperative programming. Even where I’ve used functional programming, the complexities often lay in other aspects of the software, not in the functional programming part. This was a chance to work with functional programming in a domain which was dominated by data analysis and transformation, and would be a good time to practice FP.
Tests: I wrote tests for nearly all functions that I wrote during the course. Some of these tests felt more like integration tests, rather than the tiny, concise unit tests that I write in my work life. Some of the functions were based on optimisation models that included random numbers and were non-deterministic. I don’t know how actual data scientists write tests for these, but I ran these functions for thousands of times in the tests and put some probability based checks.
What Did I Learn?
Overall, not only did the course teach me something about bioinformatics, it forced me to realise a few important things about clean code.
The biggest was that immutability, functional programming, and declarative syntax can help reduce code complexity and bloat by leaps and bounds, and is perhaps far more important to teach than all the other clean coding practices that I teach.
The other was that procedural, imperative programming is horrible, ugly, and difficult to maintain, and a significant number of clean coding practices are just focused towards making the code more declarative in nature, even if it uses loops and conditions.
The last is that dealing with algorithms and data is fun! In my line of work, the complexity often lies in connecting various systems and domains to each other, not in the algorithms. So this was a pleasant change from the usual routine.
What will these blogs cover?
In the following parts of the blog, I will discuss the problems that the course asked us to solve, including a brief overview of the biology behind it. The blogs are focused more on the programming aspect of the problems than the biology aspect.
You can read Part 2 here.