Fundamental Algorithms for Bioinformatics: Bioinformatics Algorithms
(academic year 2016/2017, second term)

Masters in Medical Bioinformatics

Title of course:Bioinformatics Algorithms
(Module 2 of Fundamental Algorithms for Bioinformatics)
Lecturer:Zsuzsanna Lipták
email: zsuzsanna DOT liptak AT univr DOT it  
(Please put "Course Bioinformatics Algorithms" in the subject line.)
Course times: Mon 11.30 - 13.30 (aula G)
Mon 14.30 - 16.30 (aula C) - from 8/5/17
Thu 8.30 - 10.30 (aula G)
Student hours:
Thu 11.00-13.00 and by appointment
Office:Ca' Vignal, 2, 1st floor, right corridor, office 1.79



News:


GOALS of the course: To learn about some of the basic problems and algorithms behind common bioinformatics applications (sequence alignment, sequence similarity, sequence assembly, phylogenetics).

CREDITS: 12 CFU, together with module 1 of this course (Algorithm Design). This module contributes half of the grade. It is gained via a written exam, followed by oral exam. You are admitted to the oral exam only if you passed the written exam.

If you are taking this course as a student of Molecular and Medical Biotechnology (LM9) ("mutuazione Algorithms for Computational Biology"), then you will get 6 CFU (credits) for this course, and will have a different exam to take. However, it still holds that you have to take a written and an oral exam.

ATTENDANCE: As in most university courses, attendance of classes is not mandatory. All of what I teach in this course is completely standard and is contained in most algorithmically oriented bioinformatics books (see e.g. list on this page). In the exam, you will be asked to do things like compute an alignment of two strings, given a scoring function; or prove the running time of Fitch's algorithm. You can learn this from a book or good online course. However, I believe that one gains much more from attending a course than from only studying by oneself. University classes give the student the opportunity of following a live course given by a live lecturer. If nothing else, attending the course forces you to spend a certain amount of time each week with this course. No handouts or slide presentations can completely substitute a lecturer. So if you can make it, I would advise you to attend. If you can't, you have to rely on the notes of your colleagues and/or on books.

MATERIAL: If there are slides, I hand them out and put them up on the webpage. However, even if there are slides, we always do more in class than what is on the slides: examples, missing proofs, additional material, illustrations, applications, etc. So you will always also need your class notes (or someone else's).

HOMEWORK: I regularly give you homework, and I expect that you do it. If you do not, you will not gain the experience in solving exercises that you need for the exam.

SYLLABUS:
Here is an overview of the topics that will be covered. The topics may vary slightly.



SUGGESTED BOOKS:


Lectures

Disclaimer: Here I list the topics which we treated in the lectures, and put up slides if we have any, as well as pointers to the respective chapters in books, and possibly other material of interest. However, you will always need the class notes, too! (even if there were slides, as well)

When I include corrections of slides, this refers to the handed out version. The slides which you can download here have already been corrected. If you find any mistakes, please let me know!

No. Date Topic Slides (if any) Homework (if any) Supporting material or chapters in books,
comments
1 Thu, 9/3/2017 Organisation of course. Introduction: Strings and sequences in Biology and in Computer Science. Pairwise alignment, exhaustive search for optimal alignment. Organisation (printable),
Strings and Sequences in Biology (printable),
Strings and Sequences in Computer Science (printable)
  1. Example on codons and reading frames (DNA translation) - p. 10 of "Strings and Sequences in Biology"
  2. How many prefixes, suffixes, substrings, and subsequences does a string of length n have?
  3. List all alignments of AC and GA
Correction of handout slides: "Strings and Sequences in Biology": p. 2: the characters in protein sequences are amino acids
2 Mo, 13/3/2017 Number of alignments: recursion, explicit exponential lower bound. Needleman-Wunsch DP-algorithm for global alignment, analysis. Pairwise Alignment (printable)   This and the following lectures follow pretty closely chapter 3 of the Setubal & Meidanis book.

Corrections of handout slides: p. 3, 4: the score of the third alignment is -4, not -2. On p. 28, it should read "Re 1" (not "Re 2")
3 Thu, 16/3/2017 Backtracing without and with traceback pointers. Space-saving variant of NW-algorithm. Local alignment: Smith-Waterman algorithm. same as before   Setubal & Meidanis, ch. 3.2.1
4 Mo, 20/3/2017 Local alignment: backtracing. Solution to second homework. Semiglobal alignment. Affine gap functions. Pairwise Alignment 2 (printable)   Setubal & Meidanis, ch. 3.2.2 and 3.2.3
5 Thu, 23/3/2017 Affine gap functions, cont'ed. Computing an optimal pairwise alignment in linear space. same as before   Setubal & Meidanis, ch. 3.3.3, Ohlebusch ch. 3.1.6
6 Thu, 30/3/2017 Computing an optimal pairwise alignment in linear space, cont'ed. The forward-backward technique. Pairwise Alignment 3 (printable) In the lecture, we computed an optimal al. for s=GAAGA and t=CACA, where for the matrix M, the priority left-diag-top was used. Redo the computation for priority top-diag-left. Setubal & Meidanis, ch. 3.3.1; Bielefeld Lecture Notes on Sequence Analysis (see above), ch. 5.3 and 5.4

Correction of handout slides: p. 9: ... they all need linear space (not time), and total space is O(n+m) (not: O(m)) because we need the space to store the alignment itself, too.
7 Mo, 3/4/2017 String distance measures: metric, edit distance, computing the edit distance, Connection between edit distance and alignment score, LCS-distance, Hamming distance. String Distance (printable)
  1. Show that the Hamming distance is a metric.
  2. Show that the LCS-distance is a metric.
  3. Find a DP-algorithm that computes LCS(s,t).
Setubal & Meidanis ch. 3.6.1, Ohlebusch ch. 8.1.5
8 Thu, 6/4/2017 String distance 2: The q-gram distance. The q-gram distance (printable)   Bielefeld Lecture Notes on Sequence Analysis (see above), ch. 3.7

Correction of first version of slides: p. 5: |s| >= q (not <=q); p. 9: dist_q(s,t) (not: dist_1(s,t)); p. 10: proof slightly updated; p. 11: added examples; p. 19: total time O(n + m + sigma^q) (not: O(n + m sigma^q)
9 Mo, 10/4/2017 De Bruijn graphs. see slides for lecture 21    
10 Thu, 20/4/2017 Database search: BLAST. BLAST (summary only!) (printable)   Here is the orginal article on BLAST: S.F. Altschul et al. Basic Local Alignment Search Tool, J. Mol. Biol. 215: 403-410 (1990); and the one on BLAST2: S.F. Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research, Vol. 25, No. 17: 3389-3402 (1997).
There is a very detailed section on BLAST in the book by Mount (full ref. see above), within chapter 7, p. 300 onwards. BLAST is also described in Chapter 6.3 of the BielSeqAn Lecture Notes (see above).
There are excellent help pages on the NCBI Blast page, among them on how to submit a request and how to read the results (version of Sept. 2016, always get the latest one!).
11 Thu, 27/4/2017 BLAST cont'ed. Scoring matrices 1. same as before    
12 Thu, 4/5/2017 Scoring matrices: PAM, BLOSUM. Scoring Matrices (printable)   Setubal & Meidanis, ch. 3.5.1
Here are two PAM mutation matrices, and two PAM scoring matrices (the ones handed out in the lecture). You can read the original article on PAM matrices (more exactly, book chapter), by Margret Dayhoff et al, 1978. It is very understandably written! Here is the BLOSUM62 matrix. See also this nice little article by Sean Eddy about the BLOSUM matrices (S. Eddy, Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 22, pp. 1035-1036, 2004).

Correction of handout slides: "Scoring Matrices": p. 7, bullet 3: Prob(b changes into a in 2 steps) and not: in k steps
13 & 14 Mo, 8/5/2017 (a.m. & p.m.) Multiple Sequence Alignment (MSA) 1 & 2: definitions, DP solution, Carillo-Lipman search space reduction, star alignment as a heuristic (similarity scores)     Setubal & Meidanis, ch. 3.4: DP-algorithm (3.4.1), Carillo-Lipman (p. 73-76), star alignment for similarity score (ch. 3.4.2); Ohlebusch, ch. 8.2 (Carillo-Lipman: 8.2.1)
15 Thu, 11/5/2017 MSA 3: progressive alignment     Bielefeld Lecture Notes on Sequence Analysis: Appendix H.1, Ohlebusch ch. 8.2.3
16 Mo, 15/5/2017 (a.m.) MSA 4: star alignment as an approximation (for distance functions with triangle inequality)   Exercises on MSA. Ohlebusch, ch. 8.2.2: star alignment
17 Mo, 15/5/2017 (p.m.) Sequence Assembly I: Overlap graphs, Shortest Common Superstrings, Greedy Algorithm Fragment Assembly 1 (printable)   Setubal & Meidanis, ch. 4.3
18 Mo, 22/5/2017 (a.m.) Stéphane Vialette (CNRS, Univ. Paris-Est): Algorithms for RNA-folding 1 slides   The Nussinov algorithm for RNA folding is also described in the Setubal & Meidanis book (ch. 8.1). And in the Bielefeld Lecture Notes on Sequence Analysis, Appendix E.
19 Mo, 22/5/2017 (p.m.) Stéphane Vialette (CNRS, Univ. Paris-Est): Flexible RNA design under structure and sequence constraints using formal languages     Seminario del Dipartimento di Informatica, abstract see here. The slides for the talk can be found here. This lecture is not part of the exam.
20 Thu, 25/5/2017 Stéphane Vialette (CNRS, Univ. Paris-Est): Algorithms for RNA-folding 2 same as for lecture 18    
21 Mo, 29/5/2017 Sequence Assembly II: de Bruijn graphs, Euler tours, SBH, NGS reads. Fragment Assembly 2 (printable)   See this article: How to apply de Bruijn graphs to genome assembly by Phillip E.C. Compeau, Pavel A. Pevzner, and Glenn Tesler, Nature Biotechnology, vol. 29 no. 11 (Nov. 2011); there is also supplementary material for this article.
22 Thu, 1/6/2017 More details and exercises on Sequence Assembly and RNA folding.     See here for an example for the Nussinov-algorithm (by Ana Damatar).
23 Mo, 5/6/2017 a.m. Travis Gagie (Diego Portales Univ., Santiago de Chile): Quick introduction to BWT     slides were distributed in class
24 Mo, 5/6/2017 p.m. Travis Gagie (Diego Portales Univ., Santiago de Chile): Fast locating with the runlength-encoded BWT     See here for an abstract.
25 Tue, 6/6/2017 Travis Gagie (Diego Portales Univ., Santiago de Chile): Compressed coloured de Bruijn graphs     Compressed colored de Bruijn graphs
26 Thu, 8/6/2017 Summary of course (part 1): pairwise sequence alignment incl. special variants, string distance measures, multiple sequence alignment.      
27 Fri, 9/6/2017 Summary of course (part 2): MSA (cont.), scoring matrices, BLAST, RNA folding, sequence assembly (overlap graphs, de Bruijn graphs)     See here for a study list (and here for the version for MMBiotech students).