Algoritmi e Linguaggi per Bioinformatica: Algoritmi

Projects for the course "Algoritmi e Linguaggi per Bioinformatica: Algoritmi (2014/2015)"

Here you can get back to the main page of the course.

Here are some tips how to make a good presentation.

Organisation:

The projects consist of the presentation of a topic, which the student has to work out by him-/herself, with support from the lecturer. The projects should be prepared in a way that they can be presented to the whole class: whoever has time to attend other people's presentation, please do! The idea is that this way, we cover topics we didn't have time for in the lecture, and everyone can learn something.

The presentations can be given either in June, or in July (exact dates to be fixed together).

The presentations can be done in 1-2 people; each person should speak for 10-15 minutes.

Each presentation should start with an introduction (what is the problem), motivation (why is this interesting), then continue with the technical part, and finish with a small summary (what did we see, what did we learn). You can assume familiarity with the contents of the lecture; so e.g. you don't have to say what a string is or why we want to do sequence alignment. But it is good to make the connection to the lecture and recall the most important definitions (e.g.: "We did alignment of two strings, now we do alignment of more than 2 strings." Or: "Recall the definition of edit distance.")

The language of the presentations is Italian or English (student's choice). Non-perfect English will not result in reduction of grades! You should use digital slides for the presentation, e.g. PowerPoint, Latex, Pages, ... But you can use the board in addition.

You can come and discuss your presentation with me as often as you like; and you have to hand in, or show, a draft of the presenatation, at least one week before the date. This will allow me to check it and to give comments.

Topics:

These are the topics. You are very welcome to suggest others. (For example, there are many other interesting topics in the Bielefeld Lecture Notes.) The topics can be done in 1 or 2, and either with biology/biotechnology (B) or computer science/bioinformatics (C) background, as specified: e.g. "1-2B or 1C" means that this topic can be done by 1 or 2 people with biology/biotech background, or by one with computer science background. The depth of coverage and the grade you receive will take into account your background. Those marked with a (*) are challenging---it would be good to have a computer science background for these.

The chapter or page numbers refer to the Setubal and Meidanis book (S&M), the Bielefeld Lecture Notes on Sequence Analysis (Biel), the Durbin et al. book (Durbin), the Cormen et al. (Cormen), the Gusfield (Gusfield), or the Jones and Pevzner book (J&P). See the main page of the course for full details on books. I can give you copies of all this material. You can also use other material you find, but please check with me. (The material could either be not sufficiently deep, or too advanced.)

When you have decided on a project, please come and see me, or write me an email. If you have questions about the topics, or want to know more before deciding, please come and see me, or write me an email.

So here is the list of topics:

Sequence alignment:

semiglobal alignment (1B): S&M 3.2.3
Another version of the DP-algorithm for pairwise sequence alignment: Now we want to disregard gaps at the beginning and/or end of one or both sequences.
comparing similar sequences (1-2B): S&M 3.3.4 (with detailed written explanation from me)
Another adjustment of the Needleman-Wunsch algorithm: When we know that the two sequences we are comparing are very similar, it suffices to compute the area around the diagonal of the DP-matrix.
general gap functions (2B): S&M 3.3.2
In the lecture, we only looked at alignment scoring schemes where k gaps are always scored the same, whether they occurred one by one, or they were together in a block. Here we look at how to deal with gap functions that allow for differently scoring gaps that occur together in a block. The alignment algorithm has to be adjusted, and it runs now in O(n^3) time.
affine gap functions (2B): S&M 3.3.3
This is a special group of functions, where the first gap has a fixed cost x (i.e. high negative score), but then every following gap adds the same cost y to the previous gap. (I.e. 4 gaps have score x+3y.) These can be dealt with much better than general gap functions, namely in quadratic time O(n^2).
pairwise alignment in linear space (1C)(*): S&M 3.3.1, Biel 12.1
A technique which allows to compute in linear space not only the score of an optimal alignment but also the alignment itself.
multiple sequence alignment: Now we want to align more than 2 sequences.
1. Introduction, DP-algo (1-2B) S&M 3.4.1 (p. 70-72): definitions, SP-score, DP-algorithm
2. progressive alignment heuristic (1-2B) Biel 16.1
3. star alignment heuristic (1-2B) S&M 3.4.2

Sequence similarity/distance:

Longest Common Subsequence and LCS-distance (1B): J&P 6.5. Another DP algorithm: for computing the longest common subsequence between two strings. There are two different DP-algorithms for this problem. LCS-distance.
q-gram distance (1-2B) Biel 3.7 (partially: ignore the parts about the ranking functions)
A different way of computing the distance between two strings: For every q-gram (string of length q), we count how many times it occurs in s and how many times in t, and build the absolute difference. The sum of these, for all q-grams, is the q-gram distance of s and t.
maximal matches distance (1-2B): Biel 3.8
Yet another distance between two strings.

More on strings:

exact matching: Knuth-Morris-Pratt algorithm (1-2B): Cormen 34.4, my own lecture notes from Cape Town (CT course notes).
In the lecture, we only considered approximate matching, while here we have a long string s (the text), and a short string p (the pattern), and we want to find all occurrences of p as a substring of s exactly.
exact matching: Boyer-Moore algorithm (1-2B, 1C): Gusfield 2.2, Cormen 32.5. Another exact string matching algorithm: it uses some surprising tricks to avoid having to look at every position in the text.
exact matching: Karp-Rabin algorithm (1C) (*): Cormen 34.2. A very nice exact matching algorithm which uses modular arithmetics for exact matching of strings.
suffix trees: Biel ch. 9, J&P 9.5. A wonderful data structure which allows to solve many problems on strings much faster, e.g. exact matching, finding palindromes, finding repeats, finding unique substrings.
1. introduction and application 1: exact string matching and explaining the basic properties of suffix trees (1-2B): Biel 9.7.1
2. WOTD algorithm (1-2B): A conceptually simple algorithm to construct the suffix tree; runs in time O(n log n).
3. 2-3 applications (1C: Biel 9.7.2-4): shortest unique substring, maximal unique matches, maximal repeats
suffix arrays: Biel ch. 10
Similar to suffix trees but uses less space.
1. motivation and definitions and construction from suffix tree; exact matching in O(m log n) time (1-2B or 1C): Biel 10.1, 10.2, 10.3.1
2. construction algorithms (*) (1-2C): 10.3 Biel
Burrows-Wheeler-Transform (*) (1-2C): Biel ch. 11
The BWT allows storing a long string in such a way that it can be compressed well (i.e. can be stored in a way that it takes less space than the length of the text), but at the same time it allows fast searching (i.e. exact matching). Several subtopics (transformation and retransformation, backward search).

Other topics:

PAM matrices (1-2B): Read and understand the original article (book chapter) by Dayhoff et al.: original article on PAM matrices. It is very nicely written and understandable, and explains more in detail the background than we had time for.
BLOSUM matrices (1-2B): Here is the original article (Henikoff & Henikoff, 1992); also this one can be helpful: S. Eddy, Where did the BLOSUM62 alignment score matrix come from?, Nature Biotechnology 22, 1035 - 1036, 2004); BLOSUM matrices are also described in all bioinformatics textbooks.
We looked in detail into PAM scoring matrices. BLOSUM matrices are the other group which is frequently used for alignment of proteins. How are BLOSUM matrices computed? Which are differences to PAM?
FASTA (1-2B): S&M 3.5.3. Another heuristic algorithm for database search, like BLAST, but simpler. Based on finding short exact matches (not similar hits as BLAST). The task here is to explain the way in which this can be done fast. Here is the original article, but the main task is to explain the algorithm as explained in the Setubal/Meidanis book.
RNA secondary structure prediction (2B, 2C) Biel ch. 8 (partially), depending on background, explanation of problem, possibly Nussinov algorithm.

Here you can get back to the main page of the course.