Bioinformatics
Institute of Computer Science
University Freiburg
de

Diploma / Master / Bachelor theses and Projects

In the following we list available, currently processed, and finished theses and student projects. When looking for a topic, please check not only the available topics but also the processed and recently finished topics. There might be unannounced but available follow up theses or projects that are not yet announced.
So if you find a topic interesting, please contact the corresponding supervisor for further information.

Bioinformatics is a highly specialized application area of computer science and biology and to successfully solve research questions in this field, you require a lot of interdisciplinary knowledge. Therefore, to do a Master thesis with us, we have the minimum requirement that you have attended one of our teaching courses. We may also ask you to present an introductory talk about your chosen topic (given material provided by us) before we can accept you. This does not apply to Bachelor theses or projects.

Open Topics

iCLIP Replicates Quality Control Visualization in Galaxy

Supervisor: Torsten Houwaart

iCLIP is a technology in next generation sequencing which helps identify interaction patterns between proteins and RNA. The goal of this project is the development of a tool which visualizes the quality of single iCLIP libraries and allows researchers to check how well iCLIP replicates correlate with each other. One important aspect of this project is the integration of this tool on our Galaxy server.

Requirements

References
  1. iCLIP pubmed link
  2. usegalaxy.org
Can be extended to a Master thesis or combined with the topic below for a group team project.

For degree: Bachelor / Project
Status: open

Evaluation of PCR-biases in CLIP-seq and single-cell RNA-seq data

Supervisor: Daniel Maticzka

High-throughput sequencing is a widely used technology for determining the sequences of a large pool of DNA fragments. When the amount of available DNA is small, the input library has to be highly amplified prior to sequencing (using PCR). The goal of this project is to determine and visualise amplification biases (sequence length, GC-content, ...) based on publicly available data prepared using Unique Molecular Identifiers (UMIs).

Requirements

For degree: Bachelor / Project
Status: open

Development of a benchmark set for HT-SELEX analysis tools

Systematic Evolution of Ligands by EXponential Enrichment (SELEX) is an experimental technique to find oligonucleotides (aptamers) that show a strong binding to a specific target, e.g. a protein. Especially, when the SELEX experiment is paired with high-throughput sequencing data, it is preferable to have a computational tool, like FASTAptamer [1] und AptaTRACE [2], that predicts the best aptamers from the data. However, until now there exists no gold-standard benchmark set for the evaluation of such tools. The task of this project would be to create such a benchmark set and use this to compare available aptamer-prediction tools based on this benchmark set. Please contact Teresa Müller if you have any further questions about the topic.

Requirements

References

  1. Khalid K. Alam et al. FASTAptamer: a bioinformatic toolkit for high-throughput sequence analysis of combinatorial selections. Mol. Ther. Nucleic Acids. pubmed link
  2. Phuong Dao et al. AptaTRACE Elucidates RNA Sequence-Structure Motifs from Selection Trends in HT-SELEX Experiments. Cell Systems pubmed link

For degree: Bachelor / Project
Status: open

Implementing new features for RNA-RNA interaction prediction

Supervisor: Martin Mann

Our group develops the tool IntaRNA (see PhD-thesis A. Richter for details), which is one of the state-of-the-art programs for RNA-RNA interaction prediction.
We are currently reimplementing and extending the tool (C++, c++11, boost, doxygen, autotools, openmp) that is hosted on Github BackofenLab/IntaRNA.
Within the development process, we offer various student projects covering different aspects of the project. For a list of open topics, please refer to "student project" marked issues @github. If you are interested, please contact Martin Mann. Most topics can be adapted to be suitable for a student project, bachelor, or master thesis.

For degree: Bachelor / Project / Master
Status: open

Implementing and testing new features for MICA

Supervisor: Martin Mann

Our group develops the tool MICA, which enables Multiple Interval-based Curve Alignment of arbitrary curve/profile data.
MICA is implemented in Java-8 and also provides an R interface for its integrated use in R.
Within the development process, we offer various student projects covering different aspects of the project. For a list of open topics, please refer to "student project" marked issues @github. If you are interested, please contact Martin Mann. Most topics can be adapted to be suitable for a student project, bachelor, or master thesis.

For degree: Bachelor / Project / Master
Status: open

Implementing bioinformatics algorithms for teaching

Supervisor: Martin Mann
Assigned to: Alexander Mattheis (further assignments possible!)

Within the last years, we have created interactive implementations of various algorithms discussed in our lectures. These are freely available at the Freiburg RNA tools - Teaching section of our public webserver.
The algorithms are implemented in Javascript and are accompanied with according visualizations to better understand each approach.
Thus, we are looking for students to extend the list of available algorithm implementations. If you are interested, please contact Martin Mann.

For degree: Project
Status: open (multiple assignments possible!)

SSE optimization of IntaRNA

IntaRNA is an algorithm for the prediction of RNA-RNA interactions (see PhD-thesis A. Richter for details). Right now a new version of IntaRNA is in development and we want to optimize the C++ source code as good as possible. This includes the usage of modern low level instructions like they are given by SSE. In this student project the IntaRNA source code will be evaluated and optimized with SSE to increase the performance of the algorithm.

Requirements

For degree: Bachelor / Project
Status: open

Further projects offered by the Galaxy team on GitHub

Supervisor: The Galaxy team

The Galaxy team also offers some projects not specified here but lists these under Björn Grüning's GitHub profile under project-ideas/issues. Please contact the person that opens the issue, or Björn Grüning.

For degree: Bachelor / Project / Master
Status: open

Docker based RNA-analysis workbench

Supervisor: Björn Grüning

You are interested in bleeding edge Linux-Kernel-Technologdy and virtualization?
You want to help to distribute software packages in a OS-independend way?
Than you can help us to solve the deployment problems of scientific software in a general way.
That project will use Docker [1], an open source project that automates the deployment of applications, to produce self-contained images (containers). These containers are OS independent, versioned (like a git-history) and easy to use, which enables reproducibility of research results and easy deployment of entire software stacks.
Prerequisites: Linux/Unix, Bash, autotools

[1] https://www.docker.io

Team-Project: can be combined with the "Graph visualization framework" and the "Galaxy Tool integration" project

For degree: Bachelor / Project / Master
Status: open

Galaxy Tool integration

Supervisor: Björn Grüning

Galaxy is an open, web-based platform for data intensive research. The University of Freiburg is running a Galaxy server to serve all different needs of our researchers. In addition to the common Next-Gerneration-Sequencing Tools, we offer Tools for cheminformatics [1], proteomics and RNA bioinformatics. To integrate an apllication into Galaxy, a thin wrapper between the Galaxy API and the targeted application needs to be written. Here usability is key. Good wrappers are easy to use and abstracting complicated application details. As part of our Galaxy project we are permanently seeking for motivated tool-wrappers that are enthusiastic about usability, want to work with a vibrant community to make Bio- & Cheminformatik Tools accessible for more researchers. The overall aim is to put the developed wrapper in the Galaxy Tool Shed [2], a Galaxy Appstore, where everyone can get there favorite application with a few mouse clicks.
Prerequisites: XML, Bash, autotools, Python

[1] https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox
[2] https://wiki.galaxyproject.org/Tool%20Shed

Team-Project: can be combined with the "Graph visualization framework" and the "Docker" project

For degree: Bachelor / Project / Master
Status: open

Further Topics ...

Further Topics are available on request.

If you have a suggestion for a topic you are interested in, do not hesitate to contact us. Otherwise, the completed theses may lead you.

Status: to find

Topics in Progress

Interactive molecule design based on graph grammars

Supervisor: Stefan Mautner
Assigned to:Ekaterina Polkh

GraphLearn is a program that learns how to design graphs given examples. It is inferring graph grammar rules (these are not unlike string grammars) from the set of given graphs and applies a Markov chain Monte Carlo (MCMC) sampling scheme.
In this project you will build a new molecular design software using graph grammars (extracted by GraphLearn) where sampling is done by human interaction via a graphical user interface.

Prerequisites

For degree: Project
Status: in progress

Prediction of non-consecutive RNA-RNA interactions

Supervisor: Martin Mann
Assigned to: Rick Gelhausen

Within this project, the state-of-the-art RNA-RNA interaction prediction algorithm implemented in the tool IntaRNA is to be extended. So far, IntaRNA can predict single-site interactions. The project will extend both the underlying theoretical work as well as the IntaRNA implementation to enable the prediction of multi-site interactions. This covers in the first part the extension of the according recursions and their theoretical evaluation. In the second part, the developed theoretical extensions are to be implemented, tuned, and tested. Project requirements

For degree: Project
Status: in progress

Large-scale clustering of non-coding RNAs in the Galaxy framework

Assigned to: Eteri Sokhoyan

Clustering of putative RNAs is currently the major approach for functional annotation of putative ncRNAs detected in genome-wide screens. GraphClust is one of the few approaches that can cluster hundreds of thousands putative ncRNAs as it is based on an alignment-free approach using an advanced graph kernel. The candidate clusters are iteratively retrieved and refined using RNA alignment tools. However the clustering pipeline requires in-depth knowledge as several tools have to be installed and configured.
The goal of this project is an extension of the GraphClust tool using Galaxy framework that makes it possible to (a) perform the clustering of RNAs via a web interface, (b) run the computations on various operating systems and computation frameworks, (c) freely customize and extend the generic pipeline for specific needs. The project involves also attempts to apply the Galaxy workflow on a metatranscriptome dataset.

For degree: Master
Status: completed

Integration of BioJS into Galaxy

Assigned to: Dimitar Jovanov und Aveg Chaudhary

Galaxy is an open, web-based platform for data-intensive research. The University of Freiburg is running a Galaxy server to serve all different needs of our researchers. Visualization is a key aspect in the understanding of data analysis for medical and biological research. The Javascript library BioJS provides powerful visualization of multiple biological data. The overall aim is to integrate specific BioJS modules into Galaxy via its plugin architecture.

For degree: Project
Status: in progress

Closed Topics

Characterization of ribosomal footprints with use of graph kernel based approaches

Assigned to: Soraya Nikousokhan

Ribosome profiling is an emerging technique that with use of deep sequencing methods, gives new insight to translation of proteins from single codon to genome scale. In comparison to former available methods microarrays and RNA-seq, Ribo-seq solely considers active mRNAs at translation phase in a cell which prepare information for protein synthesis. This novel charac- teristic of Ribo-seq provides new data with focus on translation level. The obtained patterns of ribosomal footprints may reveal new aspects in trans- lation field. The aim of this work is to classify Ribo-seq profiles according to different conditions and find clusters with respect to Ribo-seq profiles. This is done by a tool named BlockClust, which is based on a graph kernel method called Neighborhood fast graph kernel (NSPDK). BlockClust en- codes expression profiles data to graphs format and employ NSPDK method for achieving a high performance. Although BlockClust previously applied for clustering non-coding RNAs from their RNA-seq expression profiles, it can also be adapted to use for clustering and classification tasks on other types of data e.g. Ribosome profiling. Therefore, we have adapted Block- Clust by defining new attributes for finding patterns in Ribo-seq data and adding them to the former available set of attributes. Moreover, we per- formed an optimization by using different parameter sets. Furthermore, we showed that it is possible to employ BlockClust on Ribosome profiles. We achieved a good performance in classification of these profiles.

For degree: Master
Status: completed 28.10.2016 (pdf is private due to unpublished data)

Approximate nearest neighbor query methods for large scale structured datasets

Supervisor: Fabrizio Costa
Assigned to: Joachim Wolff

The task of efficiently finding the most similar representatives in a large set of entities is at the core of many problems in a variety of applications, ranging from chemoinformatics to recommendations systems; when the objects of interest are structured entities the problem becomes harder. In these cases structured instances are explicitly converted in sparse vectors that live in very high dimensional spaces (even millions of features). Exact algorithms have unfortunately a computational complexity that scales quadratically with the number of instances times the representation length of each instance, hence these approaches cannot be used when we have a large number of structured instances. A possible solution is to accept approximate results to gain efficiency. The candidate will extended one such approximate technique (the MinHash approximate nearest neighbor scheme) to efficiently solve the neighbor query in sub-linear time. The overall goals of the thesis were to provide an efficient and simple to use implementation for approximate nearest neighbor queries for large collections of high dimensional sparse vectors.

For degree: Master
Status: completed 27.6.2016 ( pdf )

Learning to design RNA polymers with graph kernels

Supervisor: Fabrizio Costa
Assigned to: Stefan Mautner

Graph data structures allow us to model complex entities in a natural and expressive way. In the machine learning literature, several types of discriminative systems that can deal with graphs in input are known (e.g. recursive neural networks, graph kernels, graphical models, etc), however, there are few generative approaches that can sample structures belonging to a desired distribution or class. The task of generating samples from a given distribution when this is accessible only via a finite number of examples is well developed when the domain of interest can be embedded in a vector space. The extension of these approaches to structured domains (i.e. where instances are strings, trees, graphs or hyper graphs) is however substantially less developed. One approach for learning constructive systems is based on a variant of the Metropolis Hastings (MH) algorithm guided by an efficient graph grammar, which, crucially, can be efficiently induced from an example set. Such a neighborhood graph based grammar is suitable when the feasibility constraints are local in nature. RNA polymers, which form structures comprising hundreds of nodes (nucleotides), exhibit however dependencies between distant portions of the structure. In order to extend the constructive system to the RNA domain, Mr. Mautner has introduced a multi level strategy based on a notion of graph minors, i.e. graphs obtained by edge contraction operations. An edge contraction is an operation which removes an edge from a graph while simultaneously merging the two vertices that it previously joined. By carefully defining a domain dependent contraction strategy, Mr. Maunter was able to operate on smaller graphs for which local rules are sufficient to capture the feasibility constraints.

For degree: Master
Status: completed 25.1.2016 ( pdf )

Reinforcement learning techniques in RNA inverse folding

Supervisor: Fabrizio Costa
Assigned to: Parastou Kohvaei

A non-coding RNA molecule functionality depends on its structure, which in turn, is determined by the specific arrangement of its nucleotides. The inverse folding of an RNA refers to the problem of designing an RNA sequence which will fold into a desired structure. This is a computationally complex problem. Algorithms which solve this problem take different approaches, but they share the following attitude: They start from an initial sequence or population and try to move it towards a desired product by performing normal or optimized search methods. RNA inverse folding programs are given different constraints such as GC-content ranges or basepair or nucleotide configurations. The output is normally one or more sequences which fold to the target structure.
This work introduces a basic system that given a set of sample RNA secondary structures, produces models which generate structures similar to the sample set. The objectives and constraints are automatically extracted from samples. For doing this, a system is designed which generates models by performing learning on families of RNA sequences. This system consists of two subsystems: one responsible for decomposing secondary structures of sample RNAs into structural features and building a structural features corpus. It also extracts neighborhood connectivity models of structural features in the form of N-grams. The other subsystem is a reinforcement learning framework which uses the corpus and connectivity rules to produce models for generating structures which are similar to the samples.
Results in this work show that the current system is able to produce models from RNA families which have a symmetric shape. To make the system capable of dealing with a broader range of RNA families and producing structures with functionalities identical to the sample structures, a refined feature extraction module has been added to the system. This module extracts the GC-content, size and local information of structural features and builds a refined feature corpus. This can provide the basis for a new set of experiments and a start point for producing models with practical applications.

For degree: Master
Status: completed 26.08.2015 ( pdf )

Explorative Enumeration of large energy landscapes

Supervisor: Martin Mann
Assigned to: Gregor Entzian

The energy landscape framework enables the study of the folding kinetics of molecules. For instance the structure formation process of single RNA molecules or the interaction formation of two RNAs. To this end, transition probabilities of one structure to possible successive structures have to be identified. Unfortunately, there is an exponential growth of possible structures a molecule can adopt and accordingly an exponential growth of the energy landscape. One approach to face this problem is to group structures into "macro-states" and to consider only transitions between such structure ensembles. But their number is often still too large to enable kinetics computations.
We have developed an explorative enumeration strategy for such macro-state partitionings of energy landscapes (see our article). Within the Master thesis project of Bettina Hübner we have investigated and evaluated different guiding strategies for such an approach
Within this project, the explorative enumeration approach is to be implemented for RNA energy landscapes.

Project Outline

For degree: Master
Status: completed 19.5.2015 (pdf is private due to pending publication)

Investigating LocARNA parameter search space by using automatic configuration methods

Supervisor: Milad Miladi
Assigned to: Teresa Müller (ALU Freiburg)

In recent years many novel RNA species have been discovered by new sequencing techniques. The correct classification of these RNAs into new and existing families heavily relies on accurate sequence-structure alignment tools, which makes it desirable to constantly improve their alignment quality. Therefore, having a high-performing RNA alignment tool is of fundamental importance in the field of computational biology.
LocARNA implements an efficient heuristic version of Sankoff's accurate but computationally expensive algorithm for simultaneous sequence and structure alignment. The use of heuristics makes the algorithm applicable in practice, but also forces the inclusion of many additional parameters. Since the performance of an algorithm depends on the parameter setting, it is desirable to optimize these settings in order to improve alignment results. One way to find optimal parameter configurations is to use an automtic algorithm configuration technique.
In this work the state of the art algorithm configuration tool SMAC is applied to improve LocARNA 's default parameter settings. The optimization focuses on fundamental parameters of the LocARNA algorithm. Both global and local alignment cases are covered, although for the local case this marks the first in-depth optimization attempt. Hence this work also introduces a complete local alignment parameter optimization pipeline for LocARNA.
As a result, improved default parameter settings as well as different input scenario settings for both the global and local alignment cases are proposed. Notably, the average alignment quality of the local case on an extension of the Bralibase dataset was improved up to 26%. In conclusion, the presented work not only managed to optimize LocARNA 's local alignment but also provides a solid foundation for further works on parameter optimization using the implemented pipeline.

For degree: Master
Status: completed 2015 (pdf is private due to pending publication)

Graph-based clustering of CRISPR-Cas systems

Supervisor: Omer Alkhnbashi
Assigned to: Sarah Alfear (ALU Freiburg)

The CRISPR-Cas system (CRISPR: clustered regularly interspaced short palindromic repeats, Cas: CRISPR-associated proteins) is an adaptive immune system in bacteria and archaea, which provides resistance against invading viruses and plasmids. CRISPRs consist of identical repeats that are between 20 to 40 base pairs in length, separated from each other by unique spacer sequences of similar length (30 to 50 base pairs). Most CRISPR arrays are flanked on the upstream (5') side by a leader sequence of 200 to 500 base pairs. These leaders often contain low complexity sequences and are rarely conserved between more distantly related species. Finally, there are the cas genes, which are often located directly up- or downstream of CRISPR array, however, they can also be found in very different locations. These genes encode protein complexes which work together with CRISPR arrays to confer the host cell with an adaptive immune system to fight invading viruses and plasmids. Recently, CRISPR-Cas systems were classified based on their associated Cas proteins. Based on this classification, which integrates phylogeny, sequence, locus organisation, and content, three types have been distinguished: Type I, Type II, and Type III CRISPR-Cas systems. The classification of Cas proteins has historically been proven difficult because of the diversity of the proteins involved.
The aim of this work is to cluster/classify all CRISPR-Cas systems based on both associated Cas proteins and repeat sequence using all available bacteria and archaea genomes. To do this, you will use the in-house graph kernel approach (EDeN) to identify clusters of related CRISPR-Cas systems and you will compare results with available classification systems.

Project Outline

For degree: Project
Status: completed 2015

Learning to Construct Graphs with Real Vector Attributes Using Graph Kernels

Supervisor: Fabrizio Costa
Assigned to: José Luis Licón Saláiz

Graph data structures allow us to model complex entities in a natural and expressive way. In the machine learning literature, several types of discriminative systems that can deal with graphs in input are known (e.g. recursive neural networks, graph kernels, graphical models, etc), however, there are few generative approaches that can sample structures belonging to a desired distribution or class. The task of generating samples from a given distribution when this is accessible only via a finite number of examples is well developed when the domain of interest can be embedded in a vector space. The extension of these approaches to structured domains (i.e. where instances are strings, trees, graphs or hyper-graphs) is however substantially less developed. While specialized applications exist, e.g. sampling phylogenetic trees, sampling dependency graphs for structural learning in graphical models, or sampling large Web like networks, data driven approaches that can deal with general types of graphs, are still in their infancy. Important applications of a successful generative graph system include the de-novo generation of molecular graphs for drugs and RNA biopolymers with user defined properties derived from prototypical natural examples. In these cases the spatial information of the atom arrangement becomes important for the determination of the associated physicochemical properties. There is therefore the necessity to upgrade these generative graph systems to deal with graphs that can encode spatial information in the form of multiple real valued attributes (e.g. 3D coordinates, distances, angles). In the Thesis the candidate will address the constructive learning problem using a variant of the Metropolis Hastings (MH) algorithm tailored for structural data types. She will upgraded the efficient graph grammar approach of a pre-existing code base to deal with graphs with real valued attributes.

For degree: Master
Status: completed 16.12.2015 ( pdf )

A graph kernel approach to the identification and characterisation of structured noncoding RNAs using multiple sequence alignment information

Supervisor: Fabrizio Costa
Assigned to: Mariam Alshaikh

Structured noncoding RNAs perform many functions that are essential for protein synthesis, RNA processing, and gene regulation. Structured RNAs can be detected by comparative genomics, in which homologous sequences are identified and inspected for mutations that conserve RNA secondary structure. To detect novel RNA classes in bacteria and archaea, a variety of bioinformatics strategies have been used, e.g. looking in upstream regions of protein coding genes for cis-regulatory RNAs. To identify ncRNAs independently from protein coding genes, Z. Weinberg has proposed a computational pipeline based on an initial BLAST clustering further refined by looking into secondary structures with CMfinder. The identified structures are then used in homology searches to find homologues that allow CMfinder to further refine its structural alignment. The resulting alignments are scored and then analysed manually to identify the most promising candidates and to infer possible biologic roles.

For degree: Master
Status: completed 13.05.2015 ( pdf )

Interactive de novo molecular design

Supervisor: Fabrizio Costa
Assigned to: Mohammed J. Chbeib

Synthesis of small molecules that improve on the curative properties of existing drugs or that are effective in previously untreatable illnesses is a very hard task, a task on which pharmaceutical companies are investing enormous amounts of resources. Computational methods become therefore an interesting solution when they can effectively replace the time consuming and expensive design, synthesis and test phases. Since de novo molecule-design systems have to explore a virtually infinite search space, exhaustive searching is infeasible, and they typically resort to local optimisation strategies. To date, one of the most critical aspects is the reliability of the evaluation function invoked to judge the quality of molecules that can be (and generally are) very different from those used in the function induction phase. One possible approach to overcome this difficulty is to integrate the expert knowledge of (medicinal) chemists in the evaluation loop. Doing so in an efficient way is not a trivial task, since one has to 1) minimise the number of times the system resorts to the expensive human oracle, and 2) use a form of interaction suitable for humans.

For degree: Master
Status: completed 5.06.2014 ( pdf )

CRISPRloci visualization

Assigned to: Mallikarjun Nuti (ALU Freiburg)

Bacteria and archaea are known to acquire immunity against viruses and plasmids through a widely conserved RNA-based gene silencing pathway. This mechanism involves non-coding RNA that originates from Clustered Regularly Inter-spaced Short Palindromic Repeats, and CRISPR-associated proteins (CRISPR-Cas system). CRISPRs consist of identical repeats that are between 24 to 47 base pairs in length, separated from each other by unique spacer sequences of similar length (27 to 72 base pairs). Most CRISPR arrays are flanked on the upstream (5') side by a leader sequence of 100 to 500 base pairs. These leaders often contain low complexity sequences and are rarely conserved between more distantly related species. Finally, there are the Cas genes, which are oftenlocated directly up- or downstream of CRISPR array, however, they can also be found in very different locations. These genes encode protein complexes which work together with CRISPR arrays to confer the host cell with an adaptive immune system to fight invading viruses and plasmids. The aim of this work is to develop web server tool for visualizing CRISPR/Cas systems features on/in genome.

Project Outline

For degree: Project
Status: completed 2014

RNA energy landscapes with pseudoknot structures

Supervisor: Martin Mann
Assigned to: Norah Schnorr

Most studies of RNA kinetics use nested structure models to enable at least moderate sequence lengths. Nevertheless, there is evidence that pseudoknot structures are important for the function of some RNA molecules. Thus, ommitting them in kinetics fosters wrong results. This project will compare kinetics based on energy landscape with and without pseudoknot structures. Furthermore, new strategies have to be explored in order to face the vast increase of the landscape size to enable reasonable studies.

For degree: Master
Status: completed 2014

Explorative Enumeration of large energy landscapes

Supervisor: Martin Mann
Assigned to: Hamdi Mohammed Yahya Aldaoos

The energy landscape framework enables the study of the folding kinetics of molecules. For instance the structure formation process of single RNA molecules or the interaction formation of two RNAs. To this end, transition probabilities of one structure to possible successive structures have to be identified. Unfortunately, there is an exponential growth of possible structures a molecule can adopt and accordingly an exponential growth of the energy landscape. One approach to face this problem is to group structures into "macro-states" and to consider only transitions between such structure ensembles. But their number is often still too large to enable kinetics computations.
We have developed an explorative enumeration strategy for such macro-state partitionings of energy landscapes (see our article). Within the Master thesis project of Bettina Hübner we have investigated and evaluated different guiding strategies for such an approach
Within this project, the explorative enumeration approach is to be implemented for RNA energy landscapes using and extending the already available algorithms from the Energy Landscape Library (ELL).

Project Outline

Similarity notions for RNA kinetics comparison

Supervisor: Martin Mann
Assigned to: Gregor Entzian

For larger RNA molecules it is often not computationally feasible to enumerate their whole energy landscape. Thus only partial fews of the landscapes are used to compute the kinetics of the respective molecule. Within this project, different strategies are explored to measure the similarity of kinetics, i.e. to evaluate how well the coarse grained model reflects the kinetics based on the complete energy landscape information.

For degree: Project
Status: completed 09.2014

Generating a local ncRNA benchmark set to evaluate local RNA alignment tools

Assigned to: Teresa Müller (ALU Freiburg)

Multiple local alignment of RNA sequences is by now still a challenging problem as parameters for already existing tools are not optimized yet for the local alignment case. The first step to solve this problem is the generation of a local benchmark set to be able to evaluate existing local RNA alignment tools. The main part of this work is the implementation of a pipeline to append genomic context of a given length to an already existing (global) benchmark set. A simple evaluation of LocARNA on the local ncRNA benchmark set and a random test set will be performed.

For degree: Project
Status: completed 2014

Differential Benchmarking of CopraRNA - Finding the optimal input for a specific question

Assigned to: Elham Abbasian (ALU Freiburg)

Small and typically non coding RNAs play a central role in the regulation of protein biosynthesis in prokaryotic organisms. Uncovering their regulons in the lab is costly and time intensive. Recently we developed an algorithm (CopraRNA) to comparatively predict their direct target mRNAs and consequently their regulons. We benchmarked the algorithm with one set of organisms (enteric bacteria). This benchmark yielded results which significantly outperformed other state of the art algorithms (IntaRNA, TargetRNA, RNApredator). This project aims at benchmarking CopraRNA with sets of several differential input data in order to draw conclusions on which properties the organism dataset should initially have.

Project Outline

For degree: Project
Status: completed 2014

Java GUI for Multiple Interval-based Curve Alignments (MICA)

Assigned to: Matthias Beck (ALU Freiburg)

We have developed a new method to to identify the best pairwise alignment of two curves and an extension to compute a multiple alignment of a set of curves, namely the Multiple Interval-based Curve Alignment (MICA). A prototype of the method was implemented in R.
Within this project, we want to reimplement and extend the MICA method in concert with a Graphical User Interface (GUI).

Project Outline

For degree: Master
Status: completed 17.02.2014 ( pdf )

Improving miRNA target prediction in humans using a highly descriptive graph-based, machine-learning model

Assigned to: Michael Uhl (ALU Freiburg)

Seven years after the Nobel Prize in Medicine was given to Andrew Fire and Craig Mello for their initial discovery of RNA interference back in 1993, computational prediction of miRNA target interactions remains a persistently tough challenge, with little performance improvement over the past few years. The problem is particularly apparent in animals, where, opposed to plants, limited complementarity of the miRNA-target hybrid is sufficient for the interaction to exhibit its regulatory effect. Ever since the introduction of the first computational prediction programs, increasingly sophisticated algorithms have tried to tackle this issue, incorporating more and more predictive features obtained from observing experimentally verified miRNA target interaction sites. Even though, as usual, the molecular mechanisms seem to be more diverse and complicated than expected just a few years ago, there are also improvements that raise researcher's hopes: Recently, more and more data sets of several high-throughput methods have become available, allowing for transcriptome-wide interaction mapping as well as precise localization of target sites. This in turn enables us to use these data and learn novel features and modes of miRNA action that can be utilized to build better performing prediction algorithms.

Project Outline In this study, we particularly want to use the popular PAR-CLIP data set that consists of ~ 17000 experimentally observed human miRNA-target interaction sites to learn novel features from the reported interaction sites using machine learning techniques. The features of possible interest consist of sequence and secondary structure information, protein binding sites, positional and locality data. All features will be integrated into a graph model to enable a more complete description of the miRNA and mRNA interaction sites. The overall goal will be to create an algorithmic pipeline for miRNA-target prediction, whose performance will then be evaluated and compared to some popular existing prediction algorithms. To further enhance prediction performance, additional data sets could be included as training sets in the course of the thesis.

For degree: Master
Status: completed 16.12.2013 ( pdf )

Pruning strategies for large energy landscapes

Assigned to: Bettina Hübner (ALU Freiburg)

The energy landscape framework enables the study of the folding kinetics of molecules. For instance the structure formation process of single RNA molecules or the interaction formation of two RNAs. To this end, transition probabilities of one structure to possible successive structures have to be identified. Unfortunately, there is an exponential growth of possible structures a molecule can adopt and accordingly an exponential growth of the energy landscape. One approach to face this problem is to group structures into "macro-states" and to consider only transitions between such structure ensembles. But their number is often still too large to enable kinetics computations.
Within this project, different approaches to prune the macro-state energy landscape represenation are tested in order to reduce the according transition encoding to a feasible size open for kinetics computations. The pruning strategies are subject to quantitative and qualitative evaluations concerning reduced computational requirements and preserved kinetics quality.

For degree: Master
Status: completed 16.10.2013

RNA Barcodes for High-Throughput Sequencing Experiments

Assigned to: Daniel Desirò (ALU Freiburg)

CLIP-seq is a method for genome-wide screening of interactions between RNAs and RNA-binding proteins. iCLIP is an extension of CLIP-seq that allows locating RNA-protein interactions with nuceleotide precision. iCLIP employs random sequence tags in to enable calculation of the number of binding events from PCR amplified source material. Errors introduced into these sequence tags during amplification or sequencing can lead to serious overestimation of binding events. This thesis examins the suitability of RNA barcodes developed for multiplex sequencing assays to prevent or mitigate this effect.

For degree: Bachelor
Status: completed 15.08.2013 ( pdf )

Graph-kernel based aromaticity prediction

Assigned to: Daniela Pütz (ALU Freiburg)

Chemical molecules can be represented as undirected labeled graphs. The Neighborhood Substructure Pairwise Distance Kernel (NSPDK) allows for a thorough description of graph features to enable machine learning based prediction tasks on such molecules. We have successfully applied the NSPDK-graph kernel in a first attempt to predict the aromaticity property of molecules using trained Support Vector Machines, an important task in chemoinformatics.
Within this project we want to extensively evaluate the existing approaches and the potential of our SVM-based aromaticity approach.

Project Outline

For degree: Bachelor
Status: completed 15.08.2013 ( pdf )

Atom mapping of chemical reactions via Constraint Programming

Supervisor: Martin Mann
Assigned to: Feras Nahar (ALU Freiburg)

The mass flow in a chemical reaction network is determined by the propagation of atoms from educt to product molecules within each of the constituent chemical reactions. The Atom Mapping Problem for a given chemical reaction is the computational task of determining the correspondences of the atoms between educt and product molecules. We have introduced a Constraint Programming approach to identify atom mappings for ``elementary'' reactions (see article). These feature a cyclic imaginary transition state (ITS) imposing an additional strong constraint on the bijection between educt and product atoms. The approach uses Constraint Programming (CP) techniqus to identify only chemically feasible ITSs by integrating the cyclic structure of the chemical transformation into the search.
Within this project, the CP-approach for the identification of chemically correct atom mappings has to be implemented and extended.

Project Outline

For degree: Master
Status: completed 17.07.2013 ( pdf )

Cluster based prediction of SH2 domain-peptide interactions using Graph Kernel

Assigned to: Vasumathi Jayakumar (ALU Freiburg)

Functional roles of peptide recognition modules (PRMs) are a key for understanding of various cellular process. Protein-protein interactions are often mediated by a class of PRMs such as SH2 and PTB domains. These domains typically bind with a phosphotyrosine containing peptides to provide a key contribution in signal transduction. Computational methods that are capable of modeling the specificity of various PRMs are therefore of high interest.

Project Outline The Neighborhood Substructure Pairwise Distance Kernel (NSPDK) allow us to extract explicit features, occurance count of the all possible pairs of near small neighbourhood subgroups. The classification would be performed by a support vector machine (SVM) based on the NSPDK graph kernel. For making efficient and powerful models that define PRMs specificity, one should consider the high-throughput data.

For degree: Project
Status: completed 28.06.2013 ( slides.pdf )

Large Scale Activity Profile Induction for Small Molecules

Supervisor: Fabrizio Costa
Assigned to: Kaswara Kraibooj (ALU Freiburg)

We are witnessing an increase of the speed at which the result of large and small scale bioassays is made available in public repositories such as PubChem through programmatic interfaces.
It is known in literature that activity patterns w.r.t. predefined bioassays contain valuable information, and can be used to define a molecular representation in the Activity Space.
Unfortunately the matrix of known compounds vs. experimentally verified bioassays results is very sparse, as most bioassays provide information for only few hundreds to thousands compounds. This in turn implies an incomplete representation in the activity space for most of the compounds. In order to extend such representation to all compounds we propose to employ in-silico models to predict the bio-activity for all compounds. The choice of computational models suitable for this task has to take into consideration several requirements:

  1. efficiency in the train and in the test phase: some bioassays with hundreds of thousands instances are available; in the test phase 30M compounds have to be screened;
  2. accuracy: the predicted activity profile has to be sufficiently close to the true activity profile to provide a reliable localization of compounds in activity space;
  3. semi-supervised mode of training should be possible: since many bioassays contain information only for few tens to hundreds compounds it is necessary to make the best use of the vast amount of unsupervised information available;

In this thesis the candidate will use a graph kernel (NSPDK) to train a linear max margin model via fast stochastic gradient descent technique. The candidate will set up the necessary infrastructure to perform and monitor the in-silico predictions and develop novel techniques for large scale semi-supervised problems in the chemoinformatics field.

For degree: Master
Status: completed 06.05.2013 ( pdf )

Analysis of CLIP-seq and PARCLIP data for Argonaute to identify miRNA target sites

Supervisor: Sita Lange
Assigned to: Michael Uhl (ALU Freiburg)

Since the discovery of microRNAs there has been a drive to identify targets to elucidate their regulatory function. The main features to predict microRNA targets are the binding of a seed region in the microRNA to the target, hybridisation energies, and conservation. Computational methods, however, still predict hundreds of targets per microRNA and many are incorrect. To improve knowledge on microRNA function and the accuracy of computational prediction methods, we need to know more about the targeting mechanism and what features can be exploited to improve prediction. This knowledge can only be deduced from experimentally validated interaction sites. Recently, high-throughput methods have been developed to identify the binding sites of RNA-binding proteins by combining UV cross-linking of the protein to the RNA, immunoprecipitation, enzymatic decay, and subsequent sequencing of protected RNA sequences. To identify microRNA target sites, the method targets Argonaute proteins that are bound to the microRNA within the RISC complex. Thus, we gain sequences that are bound to the Argonaute protein in the cell, however, we do not know which microRNA is involved. Therefore, the respective microRNAs binding to the identified sequences need to be identified.

Project Outline The goal of this topic is collect PARCLIP and HITS-CLIP data for mammals and then to develop quality measures to score how likely a microRNA binds to an identified sequence. Furthermore, general properties and statistics of the different datasets are to be explored so that a full understanding of the different characteristics of such data is understood.

For degree: Project
Status: completed February 2013

Learning binding preferences of RNA-binding proteins using in vitro affinities and in vivo binding sites

Assigned to: Mesbahuddin Anwari (ALU Freiburg)

RNA-binding proteins (RBPs), which bind to RNA, regulate the translation of RNA and play important roles in post-transcriptional gene expression processes including mRNA splicing, export, stability and mRNA localization. Identifying the binding preferences for a great number of RBPs is not sufficiently achieved, in spite of their important roles in the gene expression. RNA sequences associated with particular RBPs have been favorably characterized by in vitro and in vivo assays; although, particularly designed methods for modeling binding preferences of RBPs are required to deduce an RBP sequence and its structural preferences. Since every method has its own strengths and weaknesses, choosing the right method for a given task is difficult. In this study, an analysis over existing modeling methods is introduced to aid in making the right choice. This analysis consists of several problem sets paired with reference results and a tool to evaluate the results of the estimator methods. Finally, we present the results of applying a benchmark to several methods that includes GraphProt, RNAContext and MatrixREDUCE. This work attempts to find the best way to generate the negative data along with the use of multiple replicates of CLIP-seq method experiments as an advantage to provide training data for applying a benchmark on the modeling methods.

For degree: Master
Status: completed 21.12.2012

Structural elements in long non-coding RNAs

Assigned to: Kaswara Kraibooj (ALU Freiburg)

Non-coding RNAs (ncRNAs) form a heterogeneous class of transcripts with little or no protein-coding capacity. Recently, it turned out that these molecules have a plethora of key regulatory roles in eukaryotic cells. NcRNAs directly act at the RNA level without ever being translated to protein. According to their length, one basically distinguishes small (<200bp) and long (>200bp) ncRNAs. The function of a small RNA is typically determined by its secondary structure fold rather than underlying primary sequence. There are several ncRNA classes among small ncRNAs with well defined and well understood secondary structure motifs, examples include micro RNAs (usually forming stem-loop structures) or transfer RNAs (which exhibit the prominent cloverleaf motif). In contrast, it is unclear to which extent long non-coding RNAs contain and are determined by regions of conserved secondary structure.

The aim of this work is to analyse secondary structures of long ncRNAs on a genome-wide scale with state-of-the-art bioinformatic techniques, to possibly identify and further characterise common structural elements shared by these transcripts. This may yield novel insights to the computational de novo prediction of long ncRNAs in recently sequenced eukayotic genomes, one of the open problems in current RNA bioinformatics.

For degree: Project
Status: completed 25.06.2012 ( summary.pdf , slides.pdf )

De Novo Molecular Design Using Graph Kernels

Assigned to: Dragos Alexandru Sorescu (ALU Freiburg)

Synthesis of small molecules that improve on the curative properties of existing drugs or that are effective in previously untreatable illnesses is a very hard task, a task on which pharmaceutical companies are investing enormous amounts of resources. On average, development of a new drug takes 10 to 15 years and costs 400-800 million US dollars. Most of the effort though, is spent on investigating compounds that in the end turn out to be unsuitable because of bad ADMET properties. Since only one out of about 5000 screened drug candidates reaches the market, the pharmaceutical industry is looking for ''fail fast, fail cheap'' solutions, i.e. having fast, cheap methods of determining whether the drug candidate does or does not have suitable properties to be a drug and should be rejected. Computational methods become therefore an interesting solution when they can effectively replace the time consuming and expensive design, synthesis and test phases. Amongst such computational methods, those capable to perform de novo molecular design are particularly interesting (Schneider et al., 2005). These approaches produce novel molecular structures with desired pharmacological properties from scratch in an incremental fashion. Since de novo molecule-design systems have to explore a virtually infinite search space2, exhaustive searching is infeasible, and they typically resort to local optimization strategies. Commonly, a de novo design method has to address three questions: how to construct candidate compounds; how to evaluate their potential quality; and how to efficiently sample the search space. To date, one of the most critical aspect is the reliability of the evaluation function. Such function is in fact invoked to judge the quality of molecules that can be (and generally are) very different from those used in the function induction phase therefore leading to unreliable scores. The research project goal is to investigate novel molecular space search strategies. A local optimization algorithm starts with a feasible solution and iteratively tries to obtain a better solution by searching the neighborhood of the current solution. A critical aspect is therefore the choice of the neighborhood structure. Roughly speaking, the larger the neighborhood, the better is the quality of the locally optimal solutions, but the longer it takes to search the neighborhood. In order to define the neighborhood structure current approaches use either the notion of a single vertex/edge, or the notion of molecular fragment. In this latter case expert domain knowledge is used to define meaningful subgraphs (such as functional groups or ring structures) limiting though the search space to the current chemical knowledge available and defined in a machine readable/usable format. Here we want to investigate neighborhood structures for molecular graphs in terms of subgraphs defined implicitly through graph-theoretical notions (that are meaningful in the chemical domain) and not defined explicitly as a pre-specified set of fragments. In particular the thesis will focus on the fast graph kernel models based on the use of neighborhood subgraphs introduced in Costa et al. 2010. Here the subgraphs rooted in each vertex of a molecular graph and induced by all the vertices at bounded small distance from the root have been proved to yield information rich features.

For degree: Master
Status: completed May 2012 ( pdf )

Large scale multiple genome alignment via an efficient kernel method

Assigned to: Ammar Qaseem (ALU Freiburg)

In order to make use of the large amount of genomic information that the sequencing experiments are making available, efficient algorithmic procedures are needed. One of the most fundamental type of processing for genomic data is that of genome alignment, whereby regions belonging to several related genomes are put in bi-univocal correspondence. As a result of the alignment procedure, information of biological relevance can be derived, such as the evolutionary conservation rate of given regions. The sequences in these regions are believed to be important and to correspond to functional biological entities like proteins and non-coding RNA. Correct alignments allow, in other terms, the (semi-)automatic discovery of biological objects (either belonging to known classes, or even to yet unknown classes). However, current genomic alignment techniques 1) are suitable for relatively closely related species, and 2) can process a relatively small number of genomes. In order to allow alignments for thousands of genomes, novel efficient techniques are needed. The choice of computational models suitable for this task has to take into consideration several requirements, such as a) efficiency, b) accuracy and c) flexibility.

For degree: Master
Status: completed 10.04.2012 ( pdf )

Intersections of genomic intervals using interval trees

Assigned to: Mesbahuddin Anwari (ALU Freiburg)

Testing to find overlaps between genomic features is an important task in genomics research. We know this feature as intersection. In this project I implement a fast and exible method to find intersections between two sets of genomic intervals by using interval trees. The implementation(unionBed) uses sets of features in BED format as input data and find overlaps between them. Then the unionBed results data is used to analyse three different secondary structure prediction hypotheses for co-transcriptional RNA folding and to compare them to each other.

For degree: Project
Status: completed 15.04.2012 ( pdf )

hIntaRNA - Comparative prediction of sRNA targets in prokaryotes

Assigned to: Patrick R. Wright (ALU Freiburg)

The prediction of targets of bacterial sRNAs is a very challenging task, addressed by several approaches. The experimental testing and verification of sRNA targets is very costly and labour-intensive. Therefore, the reliable algorithmic prediction of putative sRNA targets could vastly reduce the amount of wet lab work. However, due to very short and often imperfect complementarity between the sRNA and its target the prediction is not a trivial task. The IntaRNA algorithm is one approach, which frequently, however, does not yield satisfying results yet and therefore demands improvement.
It has been stated "that it is difficult to make significant target predictions when searching sequences from a single organism, and that targets should be predicted in a comparative analysis of multiple organisms". Eventhough this was stated for eukaryotes, the basic idea of this thesis also holds for bacteria. The task of improving the IntaRNA algorithm's prediction quality utilizes exactly this concept, also incorporating the individual phylogenetic distances between the organisms analyzed. For instance, there is compelling evidence, that the MicA and RybB sRNAs in E. coli and Salmonella each have homologous targets in both organisms, thus indicating a conservation on the regulatory level. Here, the implementation of the idea that overlapping target predictions for distinct organisms yield stronger evidence of correct functional prediction is presented.

For degree: Diploma
Status: completed 14.03.2012 ( pdf )

Secondary structure motif determination in ncRNA via graph kernel based computational models

Assigned to: Kiran Kumar Telekunta (ALU Freiburg)

The function of ncRNA is determined by its nucleotide composition but also and foremost by its structure, i.e. the identity of paired and unpaired nucleotides. Key analytical tools such as folding, alignment and clustering algorithms, rely on energetic considerations to determin the optimal answer to the specific question they are designed for. These algorithms become inaccurate if the underlying assumptions (energy additivity) they are based upon is violated in reality, i.e. when non-linar effects have to be taken into account. To deal with these problems one can formulate various key questions in terms of non-­linar functional dependencies that can be learned from known examples (or parts of examples). Given the importance of the structural element in ncRNA these methods should ideally be able to work in structured domains i.e. should be able to accept in input graph data structures. The methods will belong to the family of kernel machines, since this class of algorithms allows to use heterogeneous features and to accept in input complex instances representation such as sequences or graphs. The aim of the thesis is to develop computational models capable to identify subgraphs within the ncRNA folding graph that are characteristic of biological functions and develop ways to integrate this information into some of the key algorithms for RNA structure analysis.

For degree: Master
Status: completed February 2012 ( pdf )

A Partition Function Variant of RNA Base Pair Maximization in ADP

Assigned to: Kerstin Schmatzer (ALU Freiburg)

The goal of the project is to lay the foundations of computing RNA base pair probabilities, as done by the Mc Caskill algorithm, in the framework of Algebraic Dynamic Programming (ADP). In order to concentrate on the essential aspects of this problem, we simplify the scoring model of the algorithm to a Nussinov-style base pair maximization. The main challenge is to compute the outside part, which has no natural correspondence in the grammar parsing framework underlying ADP.

For degree: Master
Status: completed 23.02.2012 ( pdf )

Generic JSP-based web frontend creation

Assigned to: Dragos-Alexandru Sorescu (ALU Freiburg)

Web frontends of terminal-based bioinformatics tools are important to ease their use for non-computer scientists and to enable ad hoc usage. The project aims at the development of a highly generic web frontend framework generalizing the currently available JSP-based frameworks of the CPSP-web-tools and Freiburg-RNA-tools. Main goals are to simplify the setup of new frontends for arbitrary terminal tools and to develop a robust generic framework. The integration is exemplified by creating a frontend for the recently developed program CARNA.

For degree: Project
Status: completed

Alignmentverbesserungen mit Hilfe von Consensus-Dotplots

Assigned to: Benjamin Schulz (ALU Freiburg)

RNA-alignments are essential for identifying and characterising structured non-coding RNA. RNA-alignments are different to DNA or protein alignments in the fact that they not only align according to sequence similarity, but also take the base-pairing patterns of secondary structures into account. A common procedure to characterise the structure of non-coding RNAs is to predict the consensus structure of elements of the same family. The problem with this, is that any errors in the alignment are reflected directly in the quality of the predicted consensus structure. Therefore, it is of high importance to get the correct alignment of RNA families. The largest database of such family alignments is contained in Rfam. A common error in these alignments is that a small subset has been misaligned with respect to the structure, which results in some stems slightly offset to either the left or right in comparison to the others.
The goal of this thesis is to develop a method to automatically detect and re-align these misaligned stems and to thus deliver a quick method to improve these common errors in the Rfam database. Furthermore, a key part of the work is to understand the state-of-the art in approaches to align RNA sequences and to perform benchmark experiments that compare current tools to the here developed method. It is also important to understand the complexity of measuring the "goodness" of one alignment and to develop and compare such measures.

For degree: Master
Status: completed 27.12.2011 ( pdf )

Local sequence and structure features in long RNA sequences

Assigned to: Hoor K. Al-hasani (ALU Freiburg)

There is much evidence in molecular biology that RNA plays an important role in living cells. Research results in the last decade have shown that protein coding sequences are only the tip of the iceberg w.r.t. genomic functional elements. Up to 90% of the genome is transcribed into RNA for which the function still remains largely unknown.
The structure of an RNA is an important property for its correct function, e.g. the cloverleaf of a tRNA. However, the experimental determination of the structure is still a very challenging task, therefore we try to deduce the structure from the nucleotide sequence, which encodes it. Furthermore, we find evidence that long RNAs have local regions of functionality and that the entire sequence does not always contribute to a particular function. For example cis regulatory elements on mRNA such as SECIS elements and miRNA binding sites.
In this project we want to analyse long RNA sequences in respect to different sequence and structure features. The project aims to identify signatures of natural RNAs and dependencies between RNA sequence and structure. Sequence features comprise the A,U,G and C content as well as di-nucleotide and tri-nucleotide content. In terms of structural features we want to consider accessibilities, base-pair probabilities, accuracies (MEA) and predictions from tools like RNALfold. Given these features, we want to identify dependencies between them and between different sequences.
First, the project involves a graph visualisation for the raw data of single features and different combinations of features. Because of the huge amount of data, we need to be able to focus or zoom into regions of interest. Further, we want to reduce the feature information to only regions of high significance in comparison to a background model. Thus, a suitable background model needs to be defined for each feature. With the simplified view, it should be easier to visually spot correlations between several features at once. After an initial visual inspection, automatic methods shall be developed to analyse real datasets of different RNA classes to identify distinct sequence and/or structure signals. First we would like to concentrate on known cis regulatory elements within the UTRs of mRNAs and finally we would like to apply the automatic analysis developed in this thesis to find unknown signals in long non-coding RNAs.

For degree: Master
Status: completed 08.11.2011 ( pdf )

RNA-Protein interaction prediction with Graph Kernels

Assigned to: Li Zaihng (ALU Freiburg)

Hundreds of RNA-binding proteins (RBPs) are known. The role of these proteins ranges from mRNA splicing, export, stability and translation regulation. Although RBPs form an important class of proteins, their binding preferences are not well characterized. It is believed that RBPs recognize RNA targets based not only on the sequence information but also on the secondary structure context of the binding. It can be argued that knowledge about RNA secondary structures when searching for the binding motif can be of fundamental importance for certain proteins. The aim of the thesis is to develop computational models capable to identify when the binding preferences of a protein is mainly based on sequence information and when additional structure information is needed. The models will belong to the family of kernel machines, since this class of algorithms allows to use heterogeneous features and to accept in input complex instances representation such as sequences or graphs.

For degree: Master
Status: completed October 2011 ( pdf )

Exparna-P

Assigned to: Noha Radwan (GUC Cairo)

The aim of this work is to help with the implementation and evaluation of the novel algorithm Exparna-P. This algorithm computes all exact pattern matchings between two RNA strands for the entire structure ensemble. In order to speed the algorithm up, a new method needs to be implemented which computes the probability that a position is unpaired under a loop. Then the already existing chaining algorithm has to be slightly modified in order to compute the best set of non-overlapping and non-crossing exact pattern matchings for Exparna-P. The third part of this bachelor thesis is the comparison of the performance of the Exparna-P tool compared to the Exparna tool.

For degree: Bachelor
Status: completed 15.09.2011 ( pdf )

Multiple sequence alignment methods of long non-coding RNAs

Assigned to: Suzana Ilinca Tudose (ALU Freiburg)

Long ncRNA is a rapidly advancing field of genetics, with yet only briely studied roles (in gene regulation), organization, conservation or medical implications. It is however expected that they will play a great role in further genetic studies and progress. Due to their (sometimes impressive) length (of up to several hundreds of kb) and other particularities, their sequences are rather difficult to align. However, valid sequence alignments are the essential pre-requisite for most subsequent bioinformatic studies of lncRNAs. Therefore, we analyse, compare and benchmark different alignment sets of vertebrate long ncRNAs, namely the Ensembl EPO alignmets, the Galaxy Multiz/TBA blocks and alignments generated by a self-developed pipeline and identify advantages and drawbacks of sequence alignments of lncRNAs.

For degree: Project
Status: completed 22.08.2011 ( summary.pdf , slides.pdf )

Evaluating contaminations in genomic sequences

Assigned to: Pavankumar Videm (ALU Freiburg)

Despite continued advances in whole genome sequencing techniques and the development of powerful assembly algorithms, newly sequenced genomes still often suffer from contaminations during the sequencing process. The most common sources of contamination are accessory DNAs deliberately attached to the DNA/RNA under investigation, including vectors, adapters, linkers, and PCR primers. However, there are also unintended events, e.g. caused by transposon activity or simply impurities, leading to contaminated genomic sequences. These may then result in missassemblies of genomic sequences, meaningless analyses and potentially erroneous conclusions. However, noone knows to which extent publicly available genomes are contaminated.

To encompass this unsatisfying situation we therefore plan to develop a comparative genomics approach to broadly identify contaminations in available genomic sequences. The project is not only open for bioinformaticians and computer scientists, it is also suitable for students with a background in biology.

For degree: Project
Status: completed 18.08.2011 ( summary.pdf , slides.pdf )

A new heuristic algorithm for IntaRNA for improved RNA-RNA interaction prediction

Assigned to: Mohamed Ezz El-Din El-Shaer (GUC Cairo)

The number of discovered ncRNAs(non-coding RNAs) that regulate target mRNAs by base pairing is growing fast. This demands for identification of the target mRNAs for those ncRNAs. Thus prediction of such interactions between ncRNAs and mRNAs became of great neccesity to help identify targets for known ncRNAs. A few computational algorithms for this purpose were developed to predict such interactions. While some of the algorithms were fast enough for genome-wide searches, they were not so accurate in predicting interactions between long RNAs. This is because they neglected an important factor for interaction formation which is the interacting site accessibility. IntaRNA considers site accessibility while maintaining the same time and space complexities of these fast algorithms. IntaRNA includes two algorithms, one that gives optimal results according to the Turner free energy model, but is time consuming with time complexity O(n2m2). The second algorithm is heuristic with time complexity O(nm) only, but does not give optimal results for all input sequences. In this thesis we present improvements over both algorithms of IntaRNA. First we modified the non-heuristic algorithm to model more accurately how RNAs are actually forming an interaction. It simulates - in the same order - the sequence of events in which interaction formation is thought to happen in real. The new implementation allows to forbid high energy barriers that might be encountered during interaction formation and that are less likely to be overcome. Second we improved the accuracy of the heuristic algorithm of IntaRNA, making it more accurate and reliable for use in biological researches, without significantly increasing its runtime and space requirements.

For degree: Bachelor
Status: completed 04.03.2011 ( pdf )

Development and Implementation of an Alignment Program for Canonical Pseudoknots

Assigned to: Bettina Hübner (ALU Freiburg)

At our lab, a general method to align various restricted classes of pseudoknots has been developed. The alignment scheme has also been implemented, but due to its generality, it is comparably slow and not suitable for many large scale practical applications. This work focuses on developing an efficient implementation of only one specialized instance of this scheme (The R&G pseudoknot class) that can be used in real practical scenarios.

The topic is suitable for people interested in algorithms, datastructures, software development, and C++ programming.

For degree: Bachelor
Status: completed 18.02.2011 ( pdf )

RNA Consensus Interaction Prediction

Assigned to: Cameron Smith (ALU Freiburg)

RNA-RNA interaction is a subject of considerable biological relevance as the binding of ncRNA to mRNA can affect both the transcription and translation of the bound mRNA and hence regulate gene expression. The accuracy and reliability of single sequence RNA structure prediction has been shown to increase significantly when the structure of an aligned set of RNA homologs is computed. As such, it is posited that by augmenting an existing RNA-RNA interaction prediction algorithm, that determines an interaction structure based only on thermodynamics, with a phylogenetic component a structure prediction of improved quality can be obtained. This thesis presents the theory, implementation and evaluation of an algorithm that combines thermodynamic and phylogenetic information to predict a consensus interaction structure on a set of aligned mRNAs and ncRNAs.

For degree: Master
Status: completed 07.01.2011 ( pdf )

Experimentelle und theoretische Untersuchungen zur Echtzeitanalyse Mikroarray-basierter RNA-Amplifikation

Supervisor: Daniel Matizcka / Rolf Backofen (Bioinformatics), Andreas Mader / Dr. Thomas Brandstetter (Biochip group), Prof. Dr. Stefan Rensing (Bioinformatics (Faculty of Biology))
Assigned to: Anselm Hoppmann (ALU Freiburg)

The Competitive Displacement Detection Method (CDDM) is a method for the determination of the concentration of an unlabeled target species within a DNA microarray. This is achieved by introducing fluorescently labeled oligonucleotides into the microarray that will compete with the target species for hybridization to the probes. In combination with the computation of binding kinetics for this competitive displacement reaction, the measurement of bound competitor can be used to deduce the concentration of the target.

The goal of this project is the adaptation of this approach to a NASBA based DNA microarray. Nucleic Acid Sequence Based Amplification (NASBA) is a method for the amplification of RNA sequences. This amplification is done within the microarray, obviating the need for external amplification reactions. With this method the final amount of amplified product is independent of the initial RNA concentration. Therefore the determination of target concentrations can only be done indirectly by considering the progress of the amplification reaction over time. This adaptation will consist of two main steps:

For degree: Diploma
Status: completed 01.2011 ( pdf )

Centroid-based identification of local RNA elements

Assigned to: Essam Abdel Moaty Abdel Hady (GUC Cairo)

In this thesis we try to tackle the problem of identifying local RNA elements in a genomewide scale. We employ a fast sparse algorithm to predict maximum expected accuracy structures based on base-pairing/unpairing probabilities. Moreover, we introduce a new locality definition and present an accuracy function reflecting this locality.
Base-pairing and base-unpairing probabilities can be efficiently computed using RNAplfold included in the Vienna package. Based on these probabilities, we identify structured regions that have high probabilities of containing significant local RNA motifs.
After that, we introduce our new program RNAMotid together with other included features that enables it to scan genome-wide sequences for structured regions. Moreover, we discuss how several modules were integrated together in our program to allow flexibility and optionality of the analysis.
Finally, we evaluate the performance of RNAMotid in identifying local RNA motifs embedded in randomly shuffled context. Before that, we apply an overall parameter training followed by a family-based parameter training. Then we discuss the factors that affect the performance of RNAMotid.

For degree: Bachelor
Status: completed 18.09.2010 ( pdf )

Exploring structural characteristics of mRNA target sites using local folding

Assigned to: Kyanoush Seyed Yahosseini (ALU Freiburg)

Messenger (m)RNA is regulated by many different classes of non-coding (nc)RNAs and proteins to either enhance or inhibit the process of translation into proteins. The complex mechanisms of how the molecules identify their specific target sites remains, largely, unknown or a speculation (e.g for miRNAs).

The aim of this topic is to explore general local structural features of experimentally verified ncRNA and protein target sites on the messenger mRNA, i.e. positional entropy, structural complexity, and accessibility. There already exists evidence of target sites being more accessible, which means the structure has less base pairs in that region and is thus "free" for further interaction. A large scale investigation should give some insight to the influences of these structural signals within and between various classes of target sites (e.g. miRNAs, siRNAs, bacterial sRNAs, etc).

The thesis involves the following tasks:

For degree: Bachelor
Status: completed 13.09.2010 ( pdf )

Folding simulations in side chain lattice protein models

Supervisor: Martin Mann
Assigned to: Mohamed Abou Hamra (GUC Cairo)

Side chain lattice protein models are a reasonable and necessary extension of the widely used backbone lattice protein models. To enable folding simulations a structural neighborhood relation, a so called move set, has to be defined that is utilizes that enable e.g. Monte-Carlo simulations of the folding process.
The thesis presents the K-local move set, a local move set defined generically for lattice protein models. The K-local move set is defined for both backbone and side-chain protein models via constraint satisfaction problems. The use of the constraint-based approach enabled its use for an arbitrary lattice. The K-local move set is then used for a simulation procedure for side-chain protein structures in the face-centered cubic lattice using real protein sequences and structures.

For degree: Bachelor
Status: completed 21.06.2010 ( pdf )

Infering RNA Stem-Loop descriptors from multiple sequence-structure alignments for an indexed-based RNA search method

Supervisor: Steffen Heyne / Sebastian Will / Michael Beckstette / Rolf Backofen
Assigned to: Baher Sabry Anis Salama (GUC Cairo)

RNA can be grouped into certain RNA families according structural and functional similarities. Currently, the Rfam 9.1 database (http://rfam.sanger.ac.uk) contains more than 1300 such families. We have already developed a fast index-based (with affix-trees) search method for RNAs. Here, the query is a descriptor and it consists of a stem-loop structure with possible wildcards at different positions. The more sequence information is given the faster is the underlying index-based search engine. On the other hand, if too much sequence information is given, related, but inexact matching stem-loop structure would not be found. Therefore, the goal of this bachelor thesis is to derive such descriptors from Rfam seed-alignments (or other multiple RNA sequence-structure alignments) too feed them into the search engine. If each necessary single descriptor gives a match within a certain region, one could infer a match of the underlying RNA family. A descriptor can been seen therefore as a necessary local motif of an RNA familiy.

For degree: Bachelor
Status: completed 05.09.2009 ( pdf )

Approximate pattern matching under generalised edit distance and extensions to suffix array library

Supervisor: Michael Beckstette / Sebastian Will
Assigned to: Abdallah El Guindy (GUC Cairo)

The approximate pattern matching problem is the problem of finding all occurences of a certain pattern in a usually much longer text allowing for a fixed error threshold in the matching. The problem has been studied extensively and many very good solutions were found. However, general enough instances of the problem, namely those allowing for generalised error functions, remain with without satisfactory algorithms. This thesis is an attempt to provide such a solution. The new method provided relies on the suffix array data structure to preprocess the text linearly and allow later for fast queries. The new algorithm has the two desirable features of having a fairly simple explanation and implementation and having space and time bounds independent of the size of the alphabet, allowing for arbitrarily large alphabets. Furthermore, the new algorithm handles wildcards quite well while retaining the same time and space worst-case complexities. The algorithms are compared on genuine genetic data from Zebrafish genome and the results are presented. Finally, a parallelized version of the algorithm is presented on CREW-PRAM model of computation. In addition to presenting the new algorithm, several contributions were made to an existing affix array library.

For degree: Bachelor
Status: completed 12.07.2009 ( pdf )

A Library for Index-based Bidirectional Pattern Search with an Application to RNA Structural Motifs

Supervisor: Michael Beckstette / Sebastian Will
Assigned to: Fernando Meyer (ALU Freiburg)

In dieser Masterarbeit präsentieren wir sowohl bekannte, als auch neue Algorithmen zur effzienten Konstruktion und Verwendung von Indexdatenstrukturen. Diese Datenstrukturen haben mannigfaltige Anwendungsmöglichkeiten im Bereich des String-processings. Insbesondere können durch sie Mustersuchen in indexierten Texten beschleunigt werden, wodurch sie eine wichtige Rolle in der Analyse biomolekulare Sequenzen wie z.B. DNA- (Desoxyribonukleinsäure), RNA- (Ribonukleinsäure) und Protein-Sequenzen, spielen.

For degree: Master
Status: completed 20.06.2009

Variations of the Sankoff-Algorithm with a Focus on Heuristics

Supervisor: Sebastian Will
Assigned to: Farah Majid Abdul Hameed (ALU Freiburg)

The combination of the alignment and secondary structure prediction solutions of two RNA sequences can significantly improve the accuracy of the structural predictions. The algorithm which simultaneously solves these problems tends to be computationally expensive like the original form "Sankoff Algorithm" (Sankoff, 1985). Thus, the methods which addressed this problem impose constraints that reduce the computational complexity by restricting the folding and/or alignment and thus make the Sankoff algorithm more practical. In this thesis, reviewing the different Sankoff-style methods in such a way that compares them corresponding to the Sankoff algorithm, through the parallels and differences. As well as, the focus is on the heuristics (i.e. the imposed constraints on the alignments and/or the structures) and comparing between them.

For degree: Master
Status: completed 15.06.2009 ( pdf )

Abstractions for barrier estimations in RNA energy landscapes

Assigned to: Sergiy Bogomolov (ALU Freiburg)

RNAs take part in diverse processes in cells. Energy landscapes can be used to characterize the structural space of an RNA and thus can help us to better understand the processes in which RNAs are involved. The task of estimating energy barriers in RNA landscapes is important in many practical problems such that kinetic RNA folding (Geis et al., 2008) and search for bistable RNA molecules (Flamm et al., 2001). A few approaches has been developed to solve this problem. They need to be improved in two ways: improve time complexity and, at the same time, improve the accuracy of estimations. This master thesis has a task of investigating possible solutions to above-mentioned problem. We apply shape abstraction to the barrier height estimation problem. In the master thesis a number of precise algorithms based on this abstraction have been developed and compared to already existing ones.

For degree: Master
Status: completed 27.05.2009 ( pdf )

Kinetics of RNA-RNA hybridization

Supervisor: Martin Mann / Anke Busch
Assigned to: Daniel Maticzka (ALU Freiburg)

There are two conceptually different approaches to predict probable structures of RNA molecules: thermodynamic and kinetic modeling of RNA folding. While purely thermodynamic approaches (e.g. RNAfold) solely consider thermodynamic properties to determine favourable structures, kinetic approaches (e.g. Kinfold) consider the structural changes over a timeframe in addition to their thermodynamic properties. While thermodynamic RNA structure prediction has been extended to RNA-RNA interactions, there doesn't seem to exist a kinetic modeling approach yet. The goal of my diploma thesis will be the implementation and evaluation of two kinetic folding algorithms for RNA-RNA interactions based on stochastic simulations using a Gillespie algorithm as well as macro states.

For degree: Diploma
Status: completed 20.04.2009 ( pdf )

Signifikanz von RNA-RNA Interaktionen und RNA Sequenz-Struktur Alignments

Assigned to: Benjamin Schulz (ALU Freiburg)

In dieser Arbeit wurden Signifikanzuntersuchungen für die am Lehrstuhl für Bioinformatik der Universität Freiburg entwickelten Programme LocARNA und IntaRNA angestellt. LocARNA bewertet anhand eines Sequenz-Struktur-Alignments die strukturelle, sowie sequenzielle Ähnlichkeit zwischen zwei ncRNA Sequenzen, IntaRNA bewertet mögliche Bindestellen zwischen ncRNA und mRNA, unter Berücksichtigung nicht nur der Sequenz, sondern auch der Sekundärstruktur der Sequenzen.
Es wurde analysiert, wie sich die ausgegebenen Bewertungen dieser zwei Programme in Abhängigkeit von den Eigenschaften Länge, AU gegen GC Anteil und minimaler freier Energie der eingegebenen Sequenzen verändern. Dazu wurden große Mengen an zufälligen Sequenzen erzeugt, die Verteilung bei Sequenzen mit gleichen Eigenschaften untersucht und geprüft, wie sich die Verteilung bei Variation der Länge und des AU zu GC Mengenverhältnisses ändert. Bei LocARNA wurde mit den Daten eine Support Vektor Maschine trainiert, die nun für Sequenzpaare die zu erwartende Verteilung angeben kann. Mit dieser Verteilung als Nullmodell ist es möglich, die P-Werte, und damit die Signifikanz, der von LocARNA ausgegebenen Bewertungen zu bestimmen.
Bei IntaRNA wurde festgestellt, dass die ncRNA Sequenzen einen Einfluss auf die Ausgabe haben, der sich nicht allein durch Länge, AU-Anteil und freier Energie erklären lässt. Hier sind weitere Untersuchungen nötig, bevor Gesetzmäßigkeiten bestimmt werden können mit denen die Signifikanz bewertet werden kann.

For degree: Bachelor
Status: completed 06.04.2009 ( pdf )

Effiziente Algorithmen zum paarweisen Sequenz-Struktur-Alignment unter Beachtung von Pseudoknoten

Supervisor: Sebastian Will / Mathias Möhl
Assigned to: Jörg Bruder (ALU Freiburg)
For degree: Diploma

Sehr viele Probleme, die sich mit Berechnungen von RNA beschäftigen, die Pseudoknoten enthalten, sind NP-hart. Am bekanntesten sind hier die Fälle der Strukturvorhersage und der Berechnung des Alignments. Im Bereich der Strukturvorhersage wurden schon verschiedenste Algorithmen vorgestellt. Diese beinhalten allerdings Einschränkungen was die Art der vorkommenden Pseudoknoten angeht, um effizient arbeiten zu können. Diese hier vorgestellte Arbeit leistet einen Beitrag zu einem neuen Algorithmus, der vom Lehrstuhl für Bioinformatik vorgestellt wurde. Mit seiner Hilfe ist es möglich, effizient Sequenz-Struktur-Alignments von RNA Sequenzen zu berechnen, die Pseudoknoten enthalten. Hierfür wird auf das Prinzip der dynamischen Programmierung zurückgegriffen.

Der Algorithmus besteht dabei im Wesentlichen aus zwei Teilen. Zuerst wird eine der beiden Sequenzen in einen Parsetree zerlegt und anschliessend das Alignment gebildet. Das Alignment profitiert hierbei von Einschränkungen auf bestimmte Klassen von Pseudoknoten auf die selbe Weise wie die jeweilige Strukturvorhersage. Dies hat zur Folge, dass die Komplexität des Alignments nur um einen linearen Faktor in Bezug auf den jeweiligen Vorhersagealgorithmus steigt.

Diese Arbeit beschäftigt sich mit dem ersten Teil, dem Berechnen einer Zerlegung der ersten Sequenz. Hier werden verschiedene Methoden untersucht, wie dies geschehen kann, sowie diese hinsichtlich ihrer Auswirkungen auf das spätere Alignment analysiert.

Status: completed 21.01.2009 ( pdf )

Multiples Sequenz-Struktur-Alignment von RNAs fester Eingabestrukturen mit konsistenzbasierter Erweiterung

Supervisor: Sebastian Will / Rolf Backofen
Assigned to: Joachim Krempel (ALU Freiburg)

For degree: Diploma
Status: completed 21.01.2009 ( pdf )

Partition Function Alignment of RNAs

Supervisor: Sebastian Will / Rolf Backofen
Assigned to: Tejal Joshi (University of Trento, Italy)

ncRNAs are observed to have important roles in transcription, translation and in post-translation activities. Computational detection of ncRNAs requires sophisticated methods which take into account structural conservation apart from sequence information. We present a new pairwise sequence-structure alignment algorithm, LocARNAp with an aim of obtaining more accurate multiple alignments than its ancestor LocARNA by using partition function of alignments and consistency based transformation.

LocARNAp is dynamic programming based sequence-structure alignment algorithm which computes posterior probabilities of edge alignment from the partition function of pairwise alignments. Obtained posterior probabilities are then consistency transformed to include information about other sequences, and thereby making an improvement in multiple alignment computed using mLocARNA.

We compare the multiple alignments generated by LocARNAp to those obtained from LocARNA, LARA, STRAL and FoldAlign using benchmark - BRAliBase 2.1�s datasets and extensively study the effect of each parameter setting on the alignment quality. The analysis of results suggests that our algorithm obtains overall better quality of results compared to its ancestor - LocARNA and other algorithms. While there is a huge scope of further improvements, LocARNAp develops a strong foundation for further research in this direction.

For degree: Master
Status: completed 25.08.2008

Ein Hybdridkinetik Ansatz für RNA Faltungswahrscheinlichkeiten

Supervisor: Rolf Backofen / Sebastian Will / Martin Mann
Assigned to: Hannes Kochniß (FSU Jena)

Es werden in der derzeitigen Forschung zwei wesentliche Ansätze für die RNA Faltungsvorhersage verwendet. Zum einen die direkte Simulation des Faltungsprozesses, bei der über viele Iterationen hinweg stochastisch Wahrscheinlichkeiten für verschiedene Faltungen/Strukturen bestimmt werden. Dies ist die am häufigsten genutzte, aber auch bei weitem aufwendigste Methode. Daher wird oft auf Kinetiken ausgewichen, welche auf einer Vereinfachung der Energielandschaft basieren. Energielandschaften sind hierbei eine diskrete Beschreibung des Strukturraums einer RNA, in dem sich der Faltungsprozess abspielt. Zum einen werden ganze Teile der Energielandschaft zu sogenannten Macrostates zusammengefasst, zum anderen wird die Landschaft oft vereinfachend durch BarrierTree repräsentiert wodurch Adjazenzinformationen zugunsten einer effizienten Berechnung verworfen werden.
In dieser Arbeit wird ein Hybridansatz vorgestellt, welcher die häufig verwendeten Macrostate- und Arrheniuskinetiken miteinander verknüpft. Für den unteren Bereich der Energielandschaft wird die Macrostatekinetik verwendet, während im oberen Bereich durch ein Sampling der Übergangshöhen eine Arrheniuskinetik möglich wird. Diese beiden Kinetiken arbeiten jedoch mit unterschiedlichen Zeitfaktoren, so dass eine Skalierung der jeweiligen Ratenmatrizen nötig ist, um die resultierende Ratenmatrix zu verwenden.
Die Arbeit untersucht Möglichkeiten der Hybridisierung beider Kinetiken, und zeigt grundsätzliche Limitierungen des Kinetikansatzes auf. Zudem wird eine Metrik für den Vergleich von Kinetiken vorgestellt, um optische Unterschiede im Faltungsverhalten zu quantifizieren. Dieses Mass wird schliesslich verwendet, um die Qualität der Hybridkinetik zu bewerten.

For degree: Diploma
Status: completed 13.08.2008 ( pdf )

Sampling von Folding funnels in diskreten Energielandschaften

Supervisor: Martin Mann / Rolf Backofen
Assigned to: Hannes Kochniß (FSU Jena)

Im Rahmen des Projektes soll eine Methode umgesetzt werden, um den folding funnel von Modellmolek�len mit diskreten Energielandschaften zu sch�tzen. Der zu erstellende Ansatz baut direkt auf vorangegangenen Arbeiten auf und erg�nzt diese um neue Datenstrukturen und Methoden.

Die Implementierung soll auf der C++ Programmierbibliothek Energy Landscape Library (ELL) aufbauen und diese ggf. erg�nzen. Zudem soll ein standalone Programm entwickelt werden, mit dem direkt folding funnel Studien erm�glicht werden. F�r einige gegebene, vorhandene Modelle sollen hierzu die gesampelten Daten mit bestehenden exakten Studien verglichen werden, um Qualit�t und Laufzeit des neuen Ansatzes zu bestimmen..

F�r Hintergrunddetails siehe pdf-Version der Projektbeschreibung.

Detailed Description: pdf (de) - (For a description in english please contact the supervisor.)

For degree: Project
Status : completed

Constraint Approach for Protein Structure Prediction in the Side Chain HP Model

Supervisor: Rolf Backofen / Sebastian Will / Martin Mann
Assigned to: Mohamad Rabbath (ALU Freiburg)

Protein structure prediction has been always a very interesting problem to solve, especially in the last 10 years. Many previous methods tried to focus on HP model prediction, however most of those methods have the drawback of giving approximate solutions. Another drawback of most of those works is that they ignore the representation of the side chains of the amino acids.

In this master thesis, we develop an approach of a concrete constraint model that takes the side chains of the amino acids into consideration and gives the exact solutions for a given sequence in the side chain HP model in terms of minimizing the energy, without any approximation.

In this work we also present the obtained results of the predication model and show some important statistics especially about the degeneracy. In addition to that we present some interesting results on generating protein like sequences, although this is a hard task in the model. The protein like sequences can be found in a very low probability in random sequences in the side chain model as we explain in this thesis.

For degree: Master
Status: completed 20.06.2008 ( pdf )

Core Construction in the Cubic Lattice

Supervisor: Sebastian Will / Martin Mann
Assigned to: Mohammad Rabbath

Im Projekt soll ein Constraint-Programm zur Konstruktion kompakter Punktmengen im kubischen Gitter implementiert werden (Kernkonstruktion). Es soll dabei die C++-Bibliothek Gecode zur Anwendung kommen. Das Problem ist ein wesentlicher Baustein der constraint-basierten Proteinstrukturvorhersage in vereinfachten Gittermodellen.

Für das kombinatorische Problem der Kernkonstruktion existieren bereits ausführliche theoretische Vorarbeiten, sowie eine Implementierung in Mozart/Oz. Es geht daher ausdrücklich um eine effiziente Reimplementierung. C++- oder Javakenntnisse, sowie Erfahrung mit Constraintprogrammierung, bzw. die Bereitschaft zur tieferen Einarbeitung in diese Technik sind für das Projekt vorausgesetzt.

Detailed Description: pdf (en)

For degree: Project
Status: completed

Combining the results of different motif discovery programs for de novo prediction of TFBS - A critical approach

Supervisor: Michael Beckstette
Assigned to: Thomas Engleitner

The project tackled the question : Can we trust the results of tools for de novo motif (TFBS) detection? If not, how can we improve the results?


For degree: Project
Status: completed 14.04.2008 ( final talk ppt )

Strukturraumanalyse von Gitterproteinen unter Verwendung von Pull Moves

Supervisor: Martin Mann / Rolf Backofen
Assigned to: Daniel Maticzka

Das Projekt beschäftigt sich mit Faltungsprozessen von Proteinstrukturen. Im besonderen werden vereinfachte Gitterproteine betrachtet, um die Komplexität der wirkenden Kräfte einzuschränken. Das Konzept diskreter Energielandschaften bildet die Grundlage der Studie und liefert das notwendige Framework.

Um Faltungsprozesse zu beschreiben müssen geeignete Umwandlungsoperationen auf Proteinstrukturen definiert werden. Eine Möglichkeit ist die Anwendung sogenannter Pull Moves, die lokale Änderungen der Struktur bewirken.

Im Rahmen des Projektes sollen die Pull Move Operationen in die generische C++ Programmierbibliothek Energy Landscape Library (ELL) integriert werden. Darauf basierend soll anschliessend ein Standalone Programm entwickelt werden, welches für eine gegebene Sequenz Faltungssimulationen ermöglicht. Für einen gegebenen Satz von Testsequenzen soll anschliessend mit Hilfe des Tools eine Klassifikation in 'gut' und 'schlecht' faltende Sequenzen erreicht werden. Hierzu müssen passende Parameter und Kriterien gefunden werden.


For degree: Project
Status: completed 07.04.2008 ( pdf )

Pairwise Comparison of RNA Secondary Structures via Exact Pattern Matches

Supervisor: Rolf Backofen / Sven Siebert
Assigned to: Steffen Heyne (FSU Jena)

In this thesis we have developed two pairwise comparison methods on the basis of exact matching substructures, called exact pattern matches. In a first step, a set of overlapping and crossing substructures for two nested RNA secondary structures is found with the approach of pairwise common substructures from Siebert/Backofen. Our first method deals with the task to identify the best global subset of Non-Crossing exact pattern matches for two given RNAs. In relation to the LAPCS problem, we call this problem the Longest Common Subsequence of Exact RNA Patterns. The developed dynamic programming algorithm needs O(n�m�) time and O(nm) space. Our second approach detects (local) clusters of exact pattern matches. A cluster is a Non-Crossing arrangement of exact pattern matches with a distance constraint between the substructures included in a cluster. The developed clustering strategy to find clusters is fast and flexible enough for different analytical problems. We have tested both methods with two Hepatitis C virus RNAs and two 16s ribosomal RNAs. The results show that both methods are able to identify significant similarities between two RNA secondary structures in a fast way.

For degree: Diploma
Status: completed 07.11.2007 ( pdf )

Identifying Key Regulators in Genetic Networks

Supervisor: Rolf Backofen / Sebastian Will / Martin Mann
Assigned to: Amira Gamal El-Din (German University in Cairo)

Genetic Regulatory Networks are a method of representing the complex assemblages of interconnected genes, proteins and other molecules. Components are represented as nodes, and activations and repressions between them are represented as labeled edges. Within these networks are Key Regulators; these are nodes capable of regulating a sub-network of the network. The task at hand is to present a model for genetic regulatory networks and to implement a means to identify the most suitable key regulator within the network. In addition, cycles representing positive or negative feedback loops increase the complexity of the task. The model is further refined by considering constraints such as choosing a key regulator that regulates a certain sub-network but has no effect on another sub-network.

For degree: Bachelor
Status: completed 30.08.2007 ( pdf )

3D-Structure-Motifs Aware Sequence Structure Alignment of RNAs

Supervisor: Rolf Backofen / Sebastian Will
Assigned to: Mounir Stino (German University in Cairo)

Comparison of RNAs is mainly based on information about the sequence and their secondary structure. The function of RNAs on the other hand is based on their 3D-Structure, which is hard to determine. However, there are wide-spread 3D motifs which can be identified more easily. Such a motif can be defined, due to Eric Westhof, as an ordered assembly of non-watson-crick basepairs within a helix. Current sequence structure alignment methods are not aware of such motifs. However, these motifs can give strong guidance for such alignments. The project's aim is to integrate the knowledge about motifs into the recent tool LocARNA, which is a program for simultaneous folding and alignment of RNAs. The dynamic programming algorithm should be modified to detect the motifs and tested on biological data.

For degree: Bachelor
Status: completed 29.08.2007 ( pdf )

Exploration of biopolymer energy landscapes via random sampling

Supervisor: Rolf Backofen / Sebastian Will
Assigned to: Andreas S. Richter (FSU Jena)

The structures of RNA molecules and proteins, which are both important biopolymers, are commonly assumed to be uniquely determined by their sequences. The structures of these biomolecules are in turn necessary to carry out the molecules' biological functions. Discretized structure models provide a coarse-grained description of the molecular structure, which is necessary to perform computational studies. In this research, RNA molecules were modeled as secondary structures for RNA, and proteins were modeled as self-avoiding walks on a lattice. The structure formation process of biopolymers is crucially determined by the properties and the topology of the underlying energy landscape, in which the folding proceeds. Typical characteristics of the energy landscape, like the number of local optima, the basin distribution as well as the transition states between the optima, can be visualized by barrier trees. Barrier trees provide a reduced representation of energy landscapes, which can be used to study the dynamical behavior of biopolymer folding.
The research described in this thesis aimed to present a generic, problem-independent approach for the generation of barrier trees.

For degree: Diploma
Status: completed 19.07.2007 ( pdf )

Efficient solving of alignment-problems with side conditions using constraint techniques

Supervisor: Rolf Backofen / Sebastian Will
Assigned to: Sebastian Barth (ALU Freiburg)

Many problems and open questions of modern microbiology demand the comparison of RNA, DNA or protein sequences. The computational effort in performing these calculations is high and microbiology urges the development of efficient tools. An important requirement is the capability of these tools to deal with certain constraints. Examples might be a specified number of matches of two compared sequences within certain regions or a consideration of the secondary structure of a proteine. Subject of this thesis is the efficient alignment of sequences, where we focus especially on different constraints on the alignment and how to combine these constraints, even in multiple alignments.

For degree: Diploma
Status: completed 10.05.2007 ( pdf )

Vollständige Aufzählung der optimalen Strukturen von Gitterproteinen durch dynamische Zerlegung des assoziierten Constraint Satisfaction Problems

Supervisor: Rolf Backofen / Sebastian Will
Assigned to: Martin Mann (FSU Jena)

The standard depth first search (DFS) method to solve Constraint Satisfaction Problems (CPSs) shows much redundant work if the CSP contains several unsolved independent partial problems. This is a frequent observation when constraint programming is applied to solve the structure prediction problem in the HP model. Within this thesis, an implementation of a dynamically decomposing search strategy is developed and implemented. The resulting prototypical implementation is applied to the structure prediction problem and yields significant speedups compared to standard DFS. This speedup is neccessary e.g. to allow for high throughput degeneracy calculation of lattice proteins in the HP model.

For degree: Diploma
Status: completed 30.01.2006 ( pdf )

MuLoRa - Ein Ansatz für multiple, lokale RNA-Sequenz-Struktur-Alignments

Supervisor: Rolf Backofen / Sebastian Will
Assigned to: Wolfgang Otto (FSU Jena)

For degree: Diploma
Status: completed 03.09.2005 ( pdf )

Merkmalauswahlverfahren zur Lokalisierung der Bindungsstellen von Transkriptionsfaktoren

Supervisor: Rolf Backofen / Rainer Pudimat
Assigned to: Sven Schütze (FSU Jena)

For degree: Diploma
Status: completed 19.07.2005 ( pdf )

Signifikanz von RNA Sequenz-Struktur-Motiven

Supervisor: Rolf Backofen / Sven Siebert
Assigned to: Marian Thieme (FSU Jena)

For degree: Diploma
Status: completed