Here you can get back to the main page of the course.

Algorithms for Computational Biology: Exam preparation (spring term 2018/2019)

Study list for students of the Molecular and Medical Biotechnology

(For the study list for students of the Masters in Medical Bioinformatics, see here.)

Note for the exam: Please arrive a bit ahead of time and make sure you have identification (with photo) with you. You are not allowed to use any material or notes, and can only use the paper provided in the exam.

The following table contains an overview the contents of the course, subdivided by areas. This is followed by typical example questions, split between more theoretical and more practical questions.

Note: I expect you to know what algorithms we studied but not necessarily the details, and only in the most fundamental cases the analysis. Of course, you are free to study all other topics that we did in class, as well. However, these topics are not considered a necessary part of your course. (These are marked with "extra".) If you know all other topics, and know some of these as well, that is of course a positive point for the exam.

  Pairwise sequence alignment
Formal Topics: alphabet, strings, prefixes, suffixes, substrings, subsequences; matrix, factorial, binomial coefficient, logarithms, sums; number of substrings, subsequences, of strings of length n; reverse complement of DNA strings; genetic code.

Theoretical: How many substrings / subsequences does a string of length n have? Give examples of typical alphabets.

Practical: Compute the reverse complement of a given string, compute the resulting protein, given the DNA or RNA sequence (genetic code supplied).

Global and local alignment Topics: Definition of alignment, number of alignments, scoring functions, optimal alignments, similarity; exhaustive search algorithm; Needleman-Wunsch DP-algorithm for global alignment; Smith-Waterman algorithm for local alignment; space-saving variant of N-W algo

Theoretical: What is the difference between global and local alignment, when are they applied? Give the definition of the cell D(i,j) for global and of the cell L(i,j) for local alignment (as opposed to how they are computed). Give the recursion for the computation. Give the running times and space requirements of these algorithms. Why are these preferable to the exhaustive search algorithm? Recursive formula for number of alignments. Explain space-saving variant of N-W algorithm.

Practical: Compute the score of a given alignment, for given scoring function; compute sim(s,t) using the DP algorithm of Needleman-Wunsch; compute an optimal alignment/all optimal alignments using the DP-table. Connection between the table and alignments: write down the alignment represented by a path in the table and vice versa; find all optimal local alignments using the DP-algorithm of Smith-Waterman.

Specialized pairwise alignment algorithms Topics: Semiglobal alignment (different variants). Algorithm for affine gap functions. Optimal alignment in linear space.

Theoretical: Problem specification, in particular of the affine gap penalty variant: what is meant by "gap open" and "gap extend". Running times of algorithms (in particular affine gap). Which version of the basic DP-algorithm would you choose for a given problem (e.g. for computing overlaps, deciding whether t is substring of s, ...).

Practical: Score given alignment using different variants (e.g. affine gap penalty, semiglobal).

extra: further details about the algorithms studied.

  String distance measures
String distance measures

Topics: Definition of a metric. Edit distance, Hamming distance, LCS-distance, q-gram distance.

Theoretical: Define unit cost edit distance. How do we have to choose the scoring function in order to have a parallel between edit distance and alignment? Explain and analyse the DP algorithm for edit distance. Define the q-gram distance.

Practical: Compute the edit distance between two strings, using the DP-algorithm. Give the series of edit operations corresponding to an alignment and vice versa. Compute the q-gram distance between two strings.

extra: further details about the algorithms studied.

  Sequence analysis in practice
Scoring matrices

Topics: Motivation for different match and mismatch scores. PAM scoring matrices. Computation of PAM scoring matrices. BLOSUM matrices (sketched only).

Theoretical: Explain how the PAM scoring matrices are computed. Explain the biological motivation. What is the underlying idea? What data were used? What does the number k mean in PAMk? What is the difference between the probability matrix and the scoring matrix? What do the entries represent? Interpret their values. Why do we use a "log-odds" matrix? What are the main differences between PAM and BLOSUM matrices?

Practical: Use a given PAMk or BLOSUM-k matrix (supplied) to score an alignment.

Dotplots Topics: Dotplots.

Compute a dotplot for a pair of strings.

BLAST

Topics: BLAST.

Theoretical: What is a heuristic? Explain the underlying ideas of BLAST. What is the advantage over the DP-algorithms? What is the primary application? Why are heuristics used and not the DP-algorithms?

Practical: Given a small example (a query and some db sequences), explain how BLAST works on the example: list some high-scoring words, show where there is a hit, show how to extend it.

  De Bruijn graphs and sequence assembly
Sequencing with de Bruijn graphs

Topics: De Bruijn graphs: definition (de Bruijn subgraphs). De Bruijn graph of a string or of a set of strings. Walks in de Bruijn graphs. Sequencing by Euler tours/paths in de Bruijn graphs.

Theoretical: What does an Euler tour / path in the de Bruijn graph of a string correspond to? How do we compute an Euler tour / path?

Practical: Given a set of reads, apply the algorithm for reconstructing the target sequence.

extra: De Bruijn sequences, full de Bruijn graphs.

  Phylogenetics
General Topics: Trees, phylogenetic trees, number of trees.

Theoretical: What is a phylogenetic tree? Kinds of phylogenetic trees. How many ways are there to root a tree? How many phylogenetic trees are there on a given set of taxa? What are the two types of input data?

Practical: Check if a given graph is a tree. Identify in a rooted tree leaves, root, and for a given node, its parent, siblings, children, ancestors, descendants. Check whether two drawings depict the same phylogenetic tree (very small examples). Root unrooted trees. Identify whether a phylogenetic tree is rooted/unrooted, binary/multifurcating, whether the branch lengths matter.

Distance based data Topics: Metric, ultrametric, UPGMA.

Theoretical: Explain the aim of distance based phylogenetic reconstruction. What is given (input), what are we looking for (output)? Explain the running time of UPGMA. Def. of ultrametric. What does "molecular clock" mean? Define metric.

Practical: Check whether a given distance matrix is ultrametric. Apply the algorithm UPGMA to a distance matrix.

extra: Additive metric.

Character based data Topics: Homoplasies, compatibility, Perfect Phylogeny. Parsimony, most parsimonious tree, Small Parsimony, Large Parsimony. Fitch' algorithm.

Theoretical: What are convergence and reversal? Def. of compatibility (of a character with a tree). Def. of Perfect Phylogeny. Does it always exist? Define parsimony (of a phylogenetic tree).

Practical: Check whether a phylogenetic tree is a PP for a character-state matrix M. Compute the parsimony score of a given phylogenetic tree. Identify cases of reversal and convergence in a phylogenetic tree. Apply Fitch' algorithm to a tree.

extra: Analysis of Fitch' algorithm.

  String indexes
String indexes

All of this topic is extra for students of Molecular and Medical Biotechnology (i.e. not necessary for the exam).

Topics: Suffix trees, suffix arrays, k-mer index.

Theoretical: When are string indexes appropriate? What is the storage space requirement of a suffix tree? In what time can it be constructed? What is pattern matching? How does one do it on the suffix tree? What are the running times of the different variants? How do we compute e.g. stringdepth of nodes?

Practical: Construct the ST of a given string T. Do pattern matching on it, given a pattern P. Compute stringdepth and number of leaves in subtree, for each node u. Construct the k-mer index of T.