Beginner Bioinformatics in Python — Part 9

Input: 
{
"GCGT": 0.1,
"CGTT": 0.15,
"GTTA": 0.25
}
Output:
{
"GCGT": 0.2,
"CGTT": 0.3,
"GTTA": 0.5
}
def Normalize(Probabilities):
sum_of_probabilities = sum(Probabilities.values())
return {k: v/sum_of_probabilities for k, v in Probabilities.items()}
Input:
{
"GCGT": 0.2,
"CGTT": 0.3,
"GTTA": 0.5
}
Output:
One out of the 3 k-mers mentioned above, but with the probability mentioned against them. For example, out of every 100 tries, GTTA should be chosen roughly 50 times, and CGTT roughly 30 times.
def WeightedDie(Probabilities):
key_to_range = key_vs_range(Probabilities)
random_fraction = random.uniform(0, 1)
return next(key for key, value in key_to_range.items() if value['lower'] < random_fraction <= value['upper'])

def key_vs_range(Probabilities):
key_list = Probabilities.keys()
value_list = map(lambda key: Probabilities[key], key_list)
upper_bound_list = reduce(lambda result, element: result + [result[-1] + element], value_list, [0])[1:]
lower_bound_list = [0] + upper_bound_list[:-1]
lower_and_upper_bounds = list(
map(lambda lower, upper: {'lower': lower, 'upper': upper}, lower_bound_list, upper_bound_list))
return dict(zip(key_list, lower_and_upper_bounds))
def test_weighted_die_throw(self):
output_count = {'x': 0, 'y': 0, 'abc': 0}
for i in range(10000):
current_output = WeightedDie({'x': 0.33, 'y': 0.5, 'abc': .17})
output_count[current_output] = output_count[current_output] + 1
self.assertTrue(4400 < output_count['y'] < 5500)
self.assertTrue(2800 < output_count['x'] < 3800)
self.assertTrue(1200 < output_count['abc'] < 2200)
Input: 
AAACCCAAACCC
{'A': [0.5, 0.1], 'C': [0.3, 0.2], 'G': [0.2, 0.4], 'T': [0.0, 0.3]}
2
Probabilities:
{'AA': 0.05, 'AC': 0.1, 'CC': 0.06, 'CA': 0.03}
Weighted Probabilities:
{'AA': 0.208, 'AC': 0.417, 'CC': 0.250, 'CA': 0.125}
Output:
Any of the k-mers above according to their weighted probabilities
def ProfileGeneratedString(Text, profile, k):
text_length = len(Text)
probabilities = {Text[index:index+k]: Pr(Text[index:index+k], profile) for index in range(text_length - k + 1)}
weighted_probabilities = Normalize(probabilities)
return WeightedDie(weighted_probabilities)
  1. Start with randomly selected k-mers (one in each DNA string) just like we did in RandomizedMotifSearch.
  2. Randomly pick an index i between 0 and n-1. We will replace the string at this index.
  3. Temporarily remove the string at index i.
  4. After temporarily removing the string at index i, create the profile matrix with the remaining k-mers.
  5. Create a weighted die from Dna[i] and the Profile Matrix.
  6. Choose the new k-mer at index i by rolling the weighted die.
  7. Repeat the process N times, where N is a sufficiently large integer that we input (for example 100).
Initial Random Motifs 
ttACCTtaac
gaTGTctgtc
ccgGCGTtag
cactaACGAg
cgtcagAGGT
The selected motifs are marked in bold. The actual correct answer (to which we want to reach) is marked in CAPITALS. In the random motifs we picked above, we've picked 3 motifs correctly, and only 2 incorrectly - at index 1 and 3 (0 based indices). Select a random indexi = 3Remove the entry at index ittACCTtaac ttACCTtaac
gaTGTctgtc gaTGTctgtc
ccgGCGTtag -----> ccgGCGTtag
cactaACGAg ----------
cgtcagAGGT cgtcagAGGT
Calculate the Profile Matrix for the remaining Motifs with pseudocountMotifs:
ACCT
GTct
GCGT
AGGT
Profile Matrix:
A: 3/8, 1/8, 1/8, 1/8
C: 1/8, 3/8, 3/8, 1/8
G: 3/8, 2/8, 3/8, 1/8
T: 1/8, 2/8, 1/8, 5/8
Compute Probabilities for all k-mers: {
cact: 15/8⁴,
acta: 9/8⁴,
ctaA: 2/8⁴,
taAC: 1/8⁴,
aACG: 9/8⁴,
ACGA: 27/8⁴,
CGAg: 2/8⁴
}
Normalize and choose one using a weighted die:
Let's say the die chose the most probable one - ACGA
Update the K-mers:
ttACCTtaac
gaTGTctgtc
ccgGCGTtag
cactaACGAg
cgtcagAGGT
def generate_new_motifs(dna_strings, kmer_length, current_motifs, index):
temp_motifs = current_motifs[:index] + current_motifs[index+1:]
profile_matrix = ProfileWithPseudocounts(temp_motifs)
new_motif = ProfileGeneratedString(dna_strings[index], profile_matrix, kmer_length)
return current_motifs[:index] + [new_motif] + current_motifs[index+1:]


def GibbsSampler(Dna, k, t, N):
random_indices = [random.randint(0, t-1) for i in range(N)]
initial_value = RandomMotifs(Dna, k, t)
output = list(reduce(
lambda current_motifs, index : generate_new_motifs(Dna, k, current_motifs, index),
random_indices,
initial_value))
return output

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store