Beginner Bioinformatics in Python — Part 1
I find the topic of genomes, and the programming involved in it fascinating. Coursera offers a reasonably good beginner level course that introduces this topic called Bioinformatics for Beginners. I had originally enrolled in this course 4 years ago, but started working on it seriously only this year. This series of blogs aims to cover the python programs and the programming paradigms written as a part of the course. It includes minimal theory to the biological part, so that the reader can connect the dots.
DNA consists of a series of neucliotides A, T, C, G. A and T are compliments of each other, and C and G are compliments of each other. So a strand of DNA consisting of various alphabets such as “AGCAATGC” would be attached to another strand built of its compliments — “TCGTTACG”.
When replication happens, these 2 strands separate, and create their own compliment pair. Thus we now get 2 pairs of similar DNA.
Since DNA itself is a collection of large amounts of data, understanding the DNA becomes an interesting problem to solve, and uses mathematics and programming.
The course starts with very simple programs and create rudimentary solutions to problems. With time, the problems get a lot more complex.
Course Programming Paradigms
Language: The course required the programming problems to be solved in Python. It used certain conventions which were different from PEP8, such as method names starting with capitals. The course prioritised bioinformatics conventions over Python conventions.
Paradigm: The course uses procedural programming. This makes sense in the beginning because to keep things simple. As data transformations become complex, procedural programs become convoluted, with loads of nesting and conditions. It seems very similar to the coding style used by graduates fresh out of college.
Programming Paradigms Used by Me
I wanted to use this chance to experiment with paradigms, and also learn a bit of Python. After a while, I created the following rules that would force me to leave my comfortable Object Oriented style of programming, and try something new —
No Classes or Types: I wanted to move away from my standard way of working, and see how data transformations work without types.
Minimalistic Code: I prioritised writing less code over almost anything else, including readability. Since there was so much data transformation happening, having concise code made it easy to understand the overall picture.
Immutability: Most of the work I’ve done in my career has used imperative programming. Even where I’ve used functional programming, the complexities often lay in other aspects of the software, not in the functional programming part. This was a chance to work with functional programming in a domain which was dominated by data analysis and transformation, and would be a good time to practice FP.
Tests: I wrote tests for nearly all functions that I wrote during the course. Some of these tests felt more like integration tests, rather than the tiny, concise unit tests that I write in my work life. Some of the functions were based on optimisation models that included random numbers and were non-deterministic. I don’t know how actual data scientists write tests for these, but I ran these functions for thousands of times in the tests and put some probability based checks.
What Did I Learn?
Overall, not only did the course teach me something about bioinformatics, it forced me to realise a few important things about clean code.
The biggest was that immutability, functional programming, and declarative syntax can help reduce code complexity and bloat by leaps and bounds. While these practices are not in violation of clean coding practices, they are different, and need to be learnt independently.
The other was that procedural, imperative programming is easy to write once, but it increases both number of lines of code, and the code complexity. This is visible when you’re building apps that are domain intensive, but in programs like these where you’re doing hundreds of data transformations, the difference is very visible.
In the last blog post, I talk about these lessons in more detail, and how you could adopt these practices in your daily life.
One final lesson I learnt was how interesting the study of genetics is. There are billions of data points on how genes influence characteristics, how they get activated and deactivated. The applications are endless, and over the next couple of decades, we’re going to learn a lot more about the kind of applications there can be of this field.
What will these blogs cover?
In the following parts of the blog, I will discuss the problems that the course asked us to solve, including a brief overview of the biology behind it. The blogs are focused more on the programming aspect of the problems than the biology aspect.
You can read Part 2 here.