Projects for the course "Algoritmi e Linguaggi per Bioinformatica: Algoritmi"

Here you can get back to the main page of the course.

Organisation:

The projects consist of the presentation of a topic, which the student has to work out by him-/herself, with support from the lecturer. The projects should be prepared in a way that they can be presented to the whole class: whoever has time to attend other people's presentation, please do! The idea is that this way, we cover topics we didn't have time for in the lecture, and everyone can learn something.

The dates of the presentation will probably be: 10-11-12-13 July. We will fix these more exactly as we get closer.

The presentations can be done in 1-3 people; each person should speak for 10-15 minutes.

Each presentation should start with an introduction (what is the problem), motivation (why is this interesting), then continue with the technical part, and finish with a small summary (what did we see, what did we learn). You can assume familiarity with the contents of the lecture; so e.g. you don't have to say what a string is or why we want to do sequence alignment. But it is good to make the connection to the lecture and recall the most important definitions (recall the definition of q-grams; we did pairwise sequence alignment, now we do multiple sequence alignment; etc.)

The language of the presentations is Italian or English (student's choice). Non-perfect English will not result in reduction of grades! You should use digital slides for the presentation, e.g. PowerPoint, Latex, Pages, ... But you can use the board in addition.

Topics:

These are the topics. You are very welcome to suggest others. They can all be done in 1, 2, or 3, unless otherwise specified. The chapter or page numbers refer to the Setubal and Meidanis book (S&M), the Bielefeld Lecture Notes on Sequence Analysis (BielefeldLN), or the Durbin et al. book (Durbin). See the main page of the course for full details on books. I can give you copies of all this material. When you have decided on a project, please come and see me, or write me an email, so we can discuss it in detail, and I can give you specific material. The idea is that we decide the exact contents together, depending also on your background, number of people, etc. If you have questions about the topics, or want to know more before deciding, please come and see me, or write me an email.

So here is the list of topics:

    Sequence alignment:
  1. general gap functions (1-2 people, or in 3 together with no. 2): S&M 3.3.2
    In the lecture, we only looked at alignment scoring schemes where k gaps are always scored the same, whether they occurred one by one, or they were together in a block. Here we look at how to deal with gap functions that allow for differently scoring gaps that occur together in a block. The alignment algorithm has to be adjusted, and it runs now in O(n^3) time.
  2. affine gap functions (1-2 people, or in 3 together with no. 1): S&M 3.3.3
    This is a special group of functions, where the first gap has a fixed cost x (i.e. high negative score), but then every following gap adds the same cost y to the previous gap. (I.e. 4 gaps have score x+3y.) These can be dealt with much better than general gap functions, namely in quadratic time O(n^2).
  3. semiglobal alignment (1 person): S&M 3.2.3
    An adjustment of the Needleman-Wunsch algorithm to the case where gaps at the end and/or beginning of sequences have cost 0. This is useful for computing overlaps, for example.
  4. comparing similar sequences (1 person): S&M 3.3.4
    Another adjustment of the Needleman-Wunsch algorithm: When we know that the two sequences we are comparing are very similar, it suffices to compute the area around the diagonal of the DP-matrix.
  5. multiple sequence alignment (2-3 people): S&M 3.4 (partially)
    Now we want to have an alignment of k>2 sequences. This is a big topic, and there are many different sub-topics, to be decided together.

  6. More on strings:
  7. exact matching: Knuth-Morris-Pratt algorithm (only for non-computer-science background):
    In the lecture, we only considered approximate matching, while here we want to know where a pattern occurs without any mistakes. Any algorithms book e.g. Cormen et al., also my own lecture notes from Cape Town (CT course notes).
  8. exact matching: Boyer-Moore algorithm (only non-computer-science background). Another exact string matching algorithm.
  9. exact matching: Karp-Rabin algorithm (computer science background): A very nice exact matching algorithm which uses arithmetics for exact matching of strings. See the Cormen book.
  10. string similarity/distance measures in comparison:
    Which are the ones in use, what are the differences, where are they employed? This is a fairly independent project. I can give you material, of course (e.g. BielefeldLN ch. 3+4, but only partially!). The idea here is to do examples, see what some of the different measures do, compare their computation times, list applications where they are used.
  11. suffix trees (2-3 people): BielefeldLN, ch. 9
    A wonderful data structure which allows to solve many problems on strings much faster, e.g. exact matching, finding palindromes, finding repeats, finding unique substrings. Depending on the number of people, we will decide on the details.
  12. suffix arrays: BielefeldLN, ch. 10
    Similar to suffix trees but uses less space.
  13. Burrows-Wheeler-Transform (student suggestion) (computer science background). BielefeldLN, ch. 11

  14. Other bioinformatics topics:
  15. BLOSUM matrices (1 person): S. Eddy, Where did the BLOSUM62 alignment score matrix come from?, Nature Biotechnology 22, 1035 - 1036, 2004), also described in several books (details on request).
    We looked in detail into PAM scoring matrices. BLOSUM matrices are the other group which is frequently used for alignment of proteins. How are BLOSUM matrices computed? Which are differences to PAM?
  16. BLAST: theory and practice.
    This project should present both the theory (how does it work, what is/are the underlying algorithm(s)), and the practice (how is it used, what does the output mean, on the NCBI website; second part should be a kind of guide to using the website).
  17. Phylogenetics: the Neighbor Joining algorithm (NJ). Durbin et al, ch. 7.3, also the Lecture Notes on Phylogenetics from Bielefeld: BielefeldPhylo, ch. 6.3.
    This is a frequently used algorithm for distance-based data. We will not have time to do it in the lecture, we will only do UPGMA. Explain and analyze NJ, what kind of data is it good for, etc.
  18. HMMs: Hidden Markov Models (2-3 people).
    This is mathematical model for interpreting variation in biological sequences, and can be used for such things as gene prediction. The best explanation is in the Durbin et al book, ch. 3+4, but it's very long; there are also my course notes on HMMs which I hope explain the topic well, see here.
  19. Branch-n-bound for Large Parsimony (1 person).
    Felsenstein ch. 4. This is a runtime heuristic (i.e. in reduces the running time in many but not all cases - by limiting the search space) for finding a most parsimonious tree.

Here you can get back to the main page of the course.