Here you can get back to the main page of the course.

Fundamental Algorithms for Bioinformatics (module 2): Exam preparation (spring term 2016/2017)

Note for the exam: Please arrive a bit ahead of time and make sure you have identification (with photo) with you. You are not allowed to use any material or notes, and can only use the paper provided in the exam.

The following table contains an overview the contents of the course, subdivided by areas. This is followed by typical example questions, split between more theoretical and more practical questions. Disclaimer: No guarantee of completeness! Of course you need to know all that we did in class!

Study list for students of the Masters in Medical Bioinformatics

Note! For the study list for students of the Masters in Molecular and Medical Biotechnology, see here.

  Pairwise sequence alignment
Formal Topics: alphabet, strings, prefixes, suffixes, substrings, subsequences; matrix, factorial, binomial coefficient, logarithms, sums; number of substrings, subsequences, of strings of length n; reverse complement of DNA strings; genetic code.

Theoretical: How many substrings / subsequences does a string of length n have? Give examples of typical alphabets.

Practical: Compute the reverse complement of a given string, compute the resulting protein, given the DNA or RNA sequence (genetic code supplied).

Global and local alignment Topics: Definition of alignment, number of alignments, scoring functions, optimal alignments, similarity; exhaustive search algorithm; Needleman-Wunsch DP-algorithm for global alignment; Smith-Waterman algorithm for local alignment; space-saving variant of N-W algo

Theoretical: What is the difference between global and local alignment, when are they applied? Give the definition of the cell D(i,j) for global and of the cell L(i,j) for local alignment (as opposed to how they are computed). Give the recursion for the computation. Give the running times and space requirements of these algorithms. Why are these preferable to the exhaustive search algorithm? Recursive formula for number of alignments. Explain space-saving variant of N-W algorithm.

Practical: Compute the score of a given alignment, for given scoring function; compute sim(s,t) using the DP algorithm of Needleman-Wunsch; compute an optimal alignment/all optimal alignments using the DP-table. Connection between the table and alignments: write down the alignment represented by a path in the table and vice versa; find all optimal local alignments using the DP-algorithm of Smith-Waterman.

Specialized pairwise alignment algorithms Topics: Semiglobal alignment (different variants). Algorithm for affine gap functions. Optimal alignment in linear space. The forward-backward technique.

Theoretical: Which version would you choose for a given problem (e.g. for computing overlaps, deciding whether t is substring of s, ...). Running times of algorithms. Be able to explain the algorithms. Explain the recursion of the affine gap penalty algorithm, why do we need 3 matrices, what is the definition of the cells, what is the final result?

Practical: Apply the algorithms to concrete examples (small examples).

String distance measures

Topics: Definition of a metric. Edit distance (unit cost and general cost), Hamming distance, LCS-distance, q-gram distance.

Theoretical: What is a metric? Define unit cost edit distance. How do we have to choose the scoring function in order to have a parallel between edit distance and alignment? Explain and analyse the DP algorithm for edit distance. Define LCS distance and Hamming distance. Define the q-gram distance. How is it computed? Which of these is a metric? Prove it.

Practical: Compute the edit distance between two strings, using the DP-algorithm. Give the series of edit operations corresponding to an alignment and vice versa. Compute the Hamming distance and the LCS distance for two strings. Compute the q-gram distance between two strings. Give two distinct strings with q-gram distance 0.

  Multiple sequence alignment
Definitions, DP-algorithm

Topics: Definition of MSA, projections (=induced alignments), SP-score. Theorem about optimal MSA and optimal projections. Its use for upper and lower bounds.

Theoretical: What are induced alignments (=projections)? What is the SP-score? Explain and analyse the DP algorithm. Is it feasible? Complexity status of MSA with SP-score.

Practical: Score a multiple alignment with the SP-score. Give the induced alignments of an MSA.

Algorithms for MSA

Topics: Carillo-Lipman search space reduction. Star alignment. Progressive alignment.

Theoretical: Which algorithms do you know? Are they heuristics or approximation algorithms, are they exact? What is the difference between a heuristic and a running time heuristic? And an approximation algorithm?

Practical: Apply the star alignment algorithm to a small example. Apply the progressive alignment algorithm to a small example (say which alignments we would do in what order, given a phylogenetic tree). Align two alignments (very small example, or just a few entries). Compute the Carillo-Lipman bounds for a given example.

  Sequence analysis in practice
Scoring matrices

Topics: Motivation for different match and mismatch scores. PAM scoring matrices. Computation of PAM scoring matrices. BLOSUM matrices (sketched only).

Theoretical: Explain how the PAM scoring matrices are computed. Explain the biological motivation. What is the underlying idea? What data were used? What does the number k mean in PAMk? What is the difference between the probability matrix and the scoring matrix? What do the entries represent? Interpret their values. Why do we use a "log-odds" matrix? What are the main differences between PAM and BLOSUM matrices?

Practical: Use a given PAMk or BLOSUM-k matrix (supplied) to score an alignment.

BLAST

Topics: BLAST.

Theoretical: What is a heuristic? Explain the underlying ideas of BLAST. What is the advantage over the DP-algorithms? What is the primary application? Why are heuristics used and not the DP-algorithms?

Practical: Given a small example (a query and a db sequence), explain how BLAST works on the example: list some high-scoring words, show where there is a hit, show how to extend it.

  RNA folding
Problem, complexity, and algorithms

Topics: RNA folding problem (=RNA secondary structure prediction). Nussinov algorithm. Folding structures, crossings, crossing-freeness. BPS-model (base pair stacking). Pseudoknots.

Questions: Explain the problem, give the different folding types. What are crossings? Explain the Nussinov algorithm. Complexity of the algorithm. When is it applicable, when not? What are pseudoknots? Give the definition of BPS. (What is the complexity of the BPS-scoring model with pseudoknots?)

  Sequence assembly algorithms
Sequencing with overlap graphs

Topics: Overlap graphs. SCS. Greedy algorithm.

Theoretical: Define the SCS problem. Why do we assume substring-freeness? Why does this not change the problem? Complexity status of SCS problem. What is the difference between maximizing compression and minimizing length? Are they equivalent? (yes w.r.t. to optimal solution, no w.r.t. to approximation; explain) What is the approximation ratio of the Greedy algorithm? Explain and analyse the greedy algorithm.

Practical: Given a set of strings (=fragments, reads), construct the overlap graph and apply the greedy algorithm.

Sequencing with de Bruijn graphs

Topics: De Bruijn graphs: definition (full de Bruijn graphs, and de Bruijn subgraphs). De Bruijn graph of a string or of a set of strings. Walks in de Bruijn graphs. Sequencing by Euler tours/paths in de Bruijn graphs.

Theoretical: What does an Euler tour / path in the de Bruijn graph of a string correspond to? How do we compute an Euler tour / path? Complexity of algorithm.

Practical: Given a set of reads, apply the algorithm for reconstructing the target sequence. Given the alphabet and k, draw / complete the de Bruijn graph of order k.