Algoritmi e Linguaggi per Bioinformatica: Algoritmi

Here you can get back to the main page of the course.

Algoritmi e Linguaggi per Bioinformatica: Algoritmi

Final exam preparation.

There will be two versions of the exam: short version (5 or 6 exercises - 2 hours), covering the second half of the course; and long version, covering the whole course (8 exercises - 3 hours). The exam questions will be in Italian and in English. You can answer either in Italian or in English. Here is a list of things you should know for the final exam:

New: The exam will take place on Tuesday 5 June, 2012, in Aula D (like last time). Please be there at 13.00, so we have time to explain the exercises.

Part I
Formal	alphabet, prefixes, suffixes, substrings, subsequences, matrix
Pairwise sequence alignment	Practical: Compute the score of a given alignment, for given scoring scheme; compute sim(s,t) using the DP algorithm of Needleman-Wunsch; compute an optimal alignment using the DP-table; find all optimal local alignments using the DP-algorithm of Smith-Waterman; understand the connection between the DP-table and alignments: being able to give the alignment represented by a path in the table and vice versa. Theoretical: Give the running times and space requirements of these algorithms. Why are these preferable to the brute-force solution?
Algorithm analysis	Theoretical: What are the two parameters of algorithms we analyse? What do O-classes measure? With respect to what? Which classes are feasible (manageable) and which aren't? Why? Practical: Put in order of O-classes a given set of functions (slowest first, fastest last). Compare two functions, which grows faster? Which is preferable for an algorithm's running time/space consumption?
Scoring matrices	Theoretical: Explain how the PAM scoring matrices are computed. Explain the biological motivation. What is the underlying idea? What data are used? What does the number k mean in PAMk? What do the entries represent? Interpret their values. Practical: Use a given PAMk matrix (to be supplied in the exam) to score an alignment. Given a small probability matrix M (supplied), compute certain individual entries of the PAM2-matrix (all necessary data supplied, no need to be able to do complex logarithms in your head).
Heuristics for sequence alignment	Theoretical: What is a heuristic? Explain the underlying ideas of FASTA and BLAST. What is the advantage over the DP-algorithms? What is the primary application? Why are heuristics used and not the DP-algorithms? Practical: Compute a dotplot for two sequences (simple and filtered). Compute common q-grams (hotspots) as does FASTA. Find high-scoring q-grams (seeds) as does BLAST (for given scoring scheme).
String similarity and distance	Practical: Give a series of edit operations (not necessarily optimal) which transform one string into the other. Compute the cost, given a cost measure. Compute the Hamming distance and q-gram distance of two strings. Give an upper bound on the edit distance. (extra: Compute the edit distance, using the theorem.) Theoretical: What similarity measures do you know? How does the q-gram distance relate to the unit-cost edit distance? Why is it useful? How do similarity score and edit distance relate? What is the mathematical abstraction for the notion of distance?
Part II
Graphs, trees, phylogenetic trees	Practical: Check whether a given graph is a tree. Draw a graph. List a given graph's edges. Use terms like edge, simple path, cycle, labelled, degree, connected. In trees: leaf, root, rooted/unrooted, depth, parent, sibling, child, ancestor, descendant. Check whether two trees are isomorphic, i.e. "the same" (very small examples). Theoretical: How many ways are there to root a tree? How many edges does a tree on n nodes have? What is a phylogenetic tree? How many phylogenetic trees are there on a given set of taxa? What are the types of input data?
Distance based data	Practical: Check whether a given distance matrix is ultrametric. Apply the algorithm UPGMA to a distance matrix. Theoretical: What is the running time of UPGMA? Connection between metric and ultrametric? Def. of ultrametric. What does "molecular clock" mean?
Character based data	Theoretical: Def. of compatibility (of a character with a tree). Def. of Perfect Phylogeny. Does is always exist? Define/explain Small Parsimony and Large Parsimony. How can these problems be solved? What algorithms have we seen? How efficient are they? Practical: Compute the parsimony score of a given labelled tree. Apply Fitch's algorithm. Apply greedy sequential algorithm. Apply Nearest Neighbor Interchange.
Statistics	Practical: Compute mean, median, variance, standard deviation, percentiles, quartiles of a small data set. Define the null hypothesis for a given problem. Decide whether or not to reject the null hypothesis (given p-value and alpha). Theoretical: What is the difference between descriptive statistics and inferential statistics? What is a sample? What is a population? Def. of sample and population mean, variance, standard deviation. What does null hypothesis mean? What is the p-value? What are type I and type II errors?

New: The exam will take place on Tuesday 5 June, 2012, in Aula D (like last time). Please be there at 13.00, so we have time to explain the exercises.