Projects for the course "Algoritmi e Linguaggi per Bioinformatica: Algoritmi (2013/2014)"

Here you can get back to the main page of the course.

Organisation:

The projects consist of the presentation of a topic, which the student has to work out by him-/herself, with support from the lecturer. The projects should be prepared in a way that they can be presented to the whole class: whoever has time to attend other people's presentation, please do! The idea is that this way, we cover topics we didn't have time for in the lecture, and everyone can learn something.

The presentations can be given either during the first week of June, or during the second half of July (exact dates to be fixed together).

The presentations can be done in 1-2 people; each person should speak for 10-15 minutes.

Each presentation should start with an introduction (what is the problem), motivation (why is this interesting), then continue with the technical part, and finish with a small summary (what did we see, what did we learn). You can assume familiarity with the contents of the lecture; so e.g. you don't have to say what a string is or why we want to do sequence alignment. But it is good to make the connection to the lecture and recall the most important definitions (e.g.: "We did similarity between strings, now we do distance between strings." Or: "Recall the definition of the SP-score.")

The language of the presentations is Italian or English (student's choice). Non-perfect English will not result in reduction of grades! You should use digital slides for the presentation, e.g. PowerPoint, Latex, Pages, ... But you can use the board in addition.

Topics:

These are the topics. You are very welcome to suggest others. The topics can be done in 1 or 2, and either with biology/biotechnology (B) or computer science/bioinformatics (C) background, as specified: e.g. "1-2B or 1C" means that this topic can be done by 1 or 2 people with biology/biotech background, or by one with computer science background. The depth of coverage and the grade you receive will take into account your background. Those marked with a (*) are challenging---it would be good to have a computer science background for these. If you find the course fairly easy, then you should choose one of these topics. The chapter or page numbers refer to the Setubal and Meidanis book (S&M), the Bielefeld Lecture Notes on Sequence Analysis (Biel), the Durbin et al. book (Durbin), the Cormen et al. (Cormen), the Gusfield (Gusfield), or the Jones and Pevzner book (J&P). See the main page of the course for full details on books. I can give you copies of all this material. You can also use other material you find, but please check with me. (The material could either be not sufficiently deep, or too advanced.) When you have decided on a project, please come and see me, or write me an email. If you have questions about the topics, or want to know more before deciding, please come and see me, or write me an email.

So here is the list of topics:

    Sequence alignment:
  1. comparing similar sequences (1-2B): S&M 3.3.4
    Another adjustment of the Needleman-Wunsch algorithm: When we know that the two sequences we are comparing are very similar, it suffices to compute the area around the diagonal of the DP-matrix.
  2. general gap functions (1-2B, or 1C): S&M 3.3.2
    In the lecture, we only looked at alignment scoring schemes where k gaps are always scored the same, whether they occurred one by one, or they were together in a block. Here we look at how to deal with gap functions that allow for differently scoring gaps that occur together in a block. The alignment algorithm has to be adjusted, and it runs now in O(n^3) time.
  3. affine gap functions (1-2B, 1C): S&M 3.3.3
    This is a special group of functions, where the first gap has a fixed cost x (i.e. high negative score), but then every following gap adds the same cost y to the previous gap. (I.e. 4 gaps have score x+3y.) These can be dealt with much better than general gap functions, namely in quadratic time O(n^2).
  4. Longest Common Subsequence (1B): J&P 6.5. Anonther DP algorithm: for computing the longest common subsequence between two strings.
  5. pairwise alignment in linear space (1-2C): S&M 3.3.1, Biel 12.1
    A technique which allows to compute in linear space not only the score of an optimal alignment but also the alignment itself.
  6. Carillo-Lipman space reduction for multiple sequence alignment (*) (1C): Biel 14.2
    The huge space requirement of the DP algorithm for MSA reduced.
  7. Divide-and-conquer for multiple sequence alignment (*) (1C): Biel 14.4
    A heuristic for sum-of-pairs MSA.

  8. Sequence similarity/distance:
  9. DP for edit distance (1B): Biel 3.5,3.6.
    We only looked in detail at similarity between strings. Another related concept is distance between strings: Two strings are the more closely related, the smaller the number of edit operations necessary to transform one into the other. The edit distance between two strings can be computed with a DP algorithm which is very similar to the one for alignment. (Goes well with no. 9, in 2 people.)
  10. edit distance and similarity (1B): S&M 3.6.1 (partially).
    What is the connection between these two concepts? How can we compute sim(s,t) if we know dist(s,t)? (Goes well with no. 8, in 2 people.)
  11. q-gram distance (1-2B) Biel 3.7 (partially: ignore the parts about the ranking functions)
    A different way of computing the distance between two strings: For every q-gram (string of length q), we count how many times it occurs in s and how many times in t, and build the absolute difference. The sum of these, for all q-grams, is the q-gram distance of s and t.
  12. maximal matches distance (1-2B): Biel 3.8
    Yet another distance between two strings.

  13. More on strings:
  14. exact matching: Knuth-Morris-Pratt algorithm (1-2B, no C): Cormen 34.4, my own lecture notes from Cape Town (CT course notes).
    In the lecture, we only considered approximate matching, while here we have a long string s (the text), and a short string p (the pattern), and we want to find all occurrences of p as a substring of s exactly.
  15. exact matching: Boyer-Moore algorithm (1-2B, 1C): Gusfield 2.2, Cormen 32.5. Another exact string matching algorithm: it uses some surprising tricks to avoid having to look at every position in the text.
  16. exact matching: Karp-Rabin algorithm (1C): Cormen 34.2. A very nice exact matching algorithm which uses arithmetics for exact matching of strings.
  17. suffix trees: Biel ch. 9, J&P 9.5
    A wonderful data structure which allows to solve many problems on strings much faster, e.g. exact matching, finding palindromes, finding repeats, finding unique substrings.
    1. WOTD algorithm (1-2B): A conceptually simple algorithm to construct the suffix tree; runs in time O(n log n).
    2. Ukkonen's linear-time construction algorithm (*) (1C)
    3. application 1: exact string matching and explaining the basic properties of suffix trees (1-2B): Biel 9.7.1
    4. 2-3 applications (1-2 C: Biel 9.7.2-4): shortest unique substring, maximal unique matches, maximal repeats
  18. suffix arrays: Biel ch. 10
    Similar to suffix trees but uses less space.
    1. motivation and definitions and construction from suffix tree; exact matching in O(m log n) time (1C): Biel 10.1, 10.2, 10.3.1
    2. construction algorithms (*) (1-2C): 10.3 Biel
  19. Burrows-Wheeler-Transform (*) (1-2C): Biel ch. 11
    The BWT allows storing a long string in such a way that it can be compressed well (i.e. can be stored in a way that it takes less space than the length of the text), but at the same time it allows fast searching (i.e. exact matching). Transformation and retransformation, backward search.

  20. Other bioinformatics topics:
  21. BLOSUM matrices (1B): S. Eddy, Where did the BLOSUM62 alignment score matrix come from?, Nature Biotechnology 22, 1035 - 1036, 2004), also described in all bioinformatics textbooks.
    We looked in detail into PAM scoring matrices. BLOSUM matrices are the other group which is frequently used for alignment of proteins. How are BLOSUM matrices computed? Which are differences to PAM?
  22. FASTA (1-2B): S&P 3.5.3. Another heuristic algorithm for database search, like BLAST, but simpler. Based on finding short exact matches (not similar hits as BLAST). The task here is to explain the way in which this can be done fast.
  23. PAM (1-2B): Read and understand the original article (book chapter) by Dayhoff et al.: original article on PAM matrices. It is very nicely written and understandable, and explains more in detail the background than we had time for.

Here you can get back to the main page of the course.