Algoritmi e Linguaggi per Bioinformatica: Algoritmi

Projects for the course "Algoritmi e Linguaggi per Bioinformatica: Algoritmi (2012/2013)"

Here you can get back to the main page of the course.

Here are some tips how to make a good presentation.

Organisation:

The projects consist of the presentation of a topic, which the student has to work out by him-/herself, with support from the lecturer. The projects should be prepared in a way that they can be presented to the whole class: whoever has time to attend other people's presentation, please do! The idea is that this way, we cover topics we didn't have time for in the lecture, and everyone can learn something.

The presentations can be given either during the first week of June, or during the second half of July (exact dates to be fixed).

The presentations can be done in 1-2 people; each person should speak for 10-15 minutes.

Each presentation should start with an introduction (what is the problem), motivation (why is this interesting), then continue with the technical part, and finish with a small summary (what did we see, what did we learn). You can assume familiarity with the contents of the lecture; so e.g. you don't have to say what a string is or why we want to do sequence alignment. But it is good to make the connection to the lecture and recall the most important definitions (e.g.: "We did similarity between strings, now we do distance between strings." Or: "Recall the definition of the SP-score.")

The language of the presentations is Italian or English (student's choice). Non-perfect English will not result in reduction of grades! You should use digital slides for the presentation, e.g. PowerPoint, Latex, Pages, ... But you can use the board in addition.

Topics:

These are the topics. You are very welcome to suggest others. The topics can be done in 1 or 2, and either with biology/biotechnology (B) or computer science/bioinformatics (C) background, as specified: e.g. "1-2B or 1C" means that this topic can be done by 1 or 2 people with biology/biotech background, or by one with computer science background. The depth of coverage and the grade you receive will take into account your background. Those marked with a (*) are challenging---it would be good to have a computer science background for these. If you find the course fairly easy, then you should choose one of these topics. The chapter or page numbers refer to the Setubal and Meidanis book (S&M), the Bielefeld Lecture Notes on Sequence Analysis (Biel), the Durbin et al. book (Durbin), the Cormen et al. (Cormen), the Gusfield (Gusfield), or the Jones and Pevzner book (J&P). See the main page of the course for full details on books. I can give you copies of all this material. You can also use other material you find, but please check with me. (The material could either be not sufficiently deep, or too advanced.) When you have decided on a project, please come and see me, or write me an email. If you have questions about the topics, or want to know more before deciding, please come and see me, or write me an email.

So here is the list of topics:

Sequence alignment and sequence similarity/distance:

comparing similar sequences (1-2B): S&M 3.3.4
Another adjustment of the Needleman-Wunsch algorithm: When we know that the two sequences we are comparing are very similar, it suffices to compute the area around the diagonal of the DP-matrix.
general gap functions (1-2B, or 1C): S&M 3.3.2
In the lecture, we only looked at alignment scoring schemes where k gaps are always scored the same, whether they occurred one by one, or they were together in a block. Here we look at how to deal with gap functions that allow for differently scoring gaps that occur together in a block. The alignment algorithm has to be adjusted, and it runs now in O(n^3) time.
affine gap functions (1-2B, 1C): S&M 3.3.3
This is a special group of functions, where the first gap has a fixed cost x (i.e. high negative score), but then every following gap adds the same cost y to the previous gap. (I.e. 4 gaps have score x+3y.) These can be dealt with much better than general gap functions, namely in quadratic time O(n^2).
DP for edit distance (1B): Biel 3.5,3.6.
We only looked in detail at similarity between strings. Another related concept is distance between strings: Two strings are the more closely related, the smaller the number of edit operations necessary to transform one into the other. The edit distance between two strings can be computed with a DP algorithm which is very similar to the one for alignment. (Goes well with no. 5, in 2.)
edit distance and similarity (1B): S&M 3.6.1.
What is the connection between these two concepts? How can we compute sim(s,t) if we know dist(s,t)? (Goes well with no. 4, in 2 people.)
forward-backward technique for pairwise alignment (1-2B or 1C): Biel 12.1, S&M 3.3.1
This is a modification of the DP algorithm where we want to compute the optimal score of an alignment of s and t which aligns s_i with t_j: i.e. the best score among all alignment which align s_i with t_j. The only difficulty here is that in the BielefeldSeqAnLN the algorithm is explained w.r.t. distance and not to similarity, so one has to translate it. (This topic goes well with no. 7 in 2.)
pairwise alignment in linear space (1-2B or 1C): Biel 12.1, S&M 3.3.1
A technique which allows to compute in linear space not only the score of an optimal alignment but also the alignment itself. (This topic goes well with no. 6, in 2 people.)
Carillo-Lipman space reduction for multiple sequence alignment (*) (1C): Biel 14.2
The huge space requirement of the DP algorithm for MSA reduced.
Divide-and-conquer for multiple sequence alignment (*) (1C): Biel 14.4
A heuristic for sum-of-pairs MSA.
q-gram distance (1-2B) Biel 3.7 (partially: ignore the parts about the ranking functions)
A different way of computing the distance between two strings: For every q-gram (string of length q), we count how many times it occurs in s and how many times in t, and build the absolute difference. The sum of these, for all q-grams, is the q-gram distance of s and t. (Goes well with no. 10, in 2 people.)
maximal matches distance (1-2B): Biel 3.8
Yet another distance between two strings. (Goes well with no. 11, in 2 people.)

More on strings:

exact matching: Knuth-Morris-Pratt algorithm (1-2B, no C): Cormen 34.4, my own lecture notes from Cape Town (CT course notes).
In the lecture, we only considered approximate matching, while here we have a long string s (the text), and a short string p (the pattern), and we want to find all occurrences of p as a substring of s exactly.
exact matching: Boyer-Moore algorithm (1-2B, 1C): Gusfield 2.2, Cormen 32.5. Another exact string matching algorithm: it uses some surprising tricks to avoid having to look at every position in the text.
exact matching: Karp-Rabin algorithm (1C): Cormen 34.2. A very nice exact matching algorithm which uses arithmetics for exact matching of strings.
suffix trees: Biel ch. 9, J&P 9.5
A wonderful data structure which allows to solve many problems on strings much faster, e.g. exact matching, finding palindromes, finding repeats, finding unique substrings.
1. WOTD algorithm (1-2B) A conceptually simple algorithm to construct the suffix tree; runs in time O(n log n).
2. Ukkonen's linear-time construction algorithm (*) (1C)
3. application 1: exact string matching and explaining the basic properties of suffix trees (1-2B): Biel 9.7.1
4. 2-3 applications (1-2 C: Biel 9.7.2-4): shortest unique substring, maximal unique matches, maximal repeats
suffix arrays: Biel ch. 10
Similar to suffix trees but uses less space.
1. motivation and definitions and construction from suffix tree; exact matching in O(m log n) time (1C): Biel 10.1, 10.2, 10.3.1
2. construction algorithms (*) (1-2C): 10.3 Biel
Burrows-Wheeler-Transform (*) (1-2C): Biel ch. 11
This is a way of storing a long string in such a way that it can be compressed well (i.e. can be stored in a way that it takes less space than the length of the text), but at the same time it allows fast searching (i.e. exact matching). Transformation and retransformation, backward search.
Jumbled Pattern Matching (this is my own research): Here we have again a text s and a pattern p, but we are looking for substring of s which are jumbled versions of p, i.e. which have the same multiplicity of every character. For instance, if s=aabcabaac, and p=aba, then in pos. 1,5 and 6, there is a jumbled occurrence of p. There are several sub-topics here, both easy and more challenging, come and see me for material. Here the difficulty is that this isn't explained in textbooks since this is a current research topic.
1. simple window algorithm + motivation from Mass Spectrometry(1B), or or 2B together with b.
2. linear space data structure for Binary Jumbled Pattern Matching (there are only two possible characters): 1B, or 2B together with a.
3. Jumping algorithm for general alphabets (more than 2 characters): (*) 1-2 C

Other bioinformatics topics:

BLOSUM matrices (1B): S. Eddy, Where did the BLOSUM62 alignment score matrix come from?, Nature Biotechnology 22, 1035 - 1036, 2004), also described in all bioinformatics textbooks.
We looked in detail into PAM scoring matrices. BLOSUM matrices are the other group which is frequently used for alignment of proteins. How are BLOSUM matrices computed? Which are differences to PAM?
HMMs: Hidden Markov Models (*) (2C).
This is mathematical model for interpreting variation in biological sequences, and can be used for such things as gene prediction. The best explanation is in the Durbin et al book, ch. 3+4, but that's very long; there are also my course notes on HMMs which I hope explain the topic well, see here.

Here you can get back to the main page of the course.