diArk - a resource for eukaryotic genome research


Since the publication of the first complete genome sequence of an eukaryote, Saccharomyces cerevisiae [1], the genome sequencing community has produced highly advanced drafts of many other eukaryotes. The past few years have thus seen the rise of a completely new field in biology that is described as comparative genomics [2]. Initial results have shown that whole genome comparisons are important to improve the annotation of genes and transcripts of a genome. It has also been demonstrated that not only genome sequences of organisms spread over all kingdoms of eukaryotic life are needed but also many of closely related organisms [3].
We have developed diArk (digital ark) providing information on eukaryotic sequencing projects that resulted either in at least preliminary assemblies of genome data or a substantial amount of EST or cDNA data [4,5,6]. In the center of the database are extensive species-related information (commonly and alternatively used scientific names, common names, and complete taxonomies) and much information about the respective species sequencing projects. Apart from the up-to-date status of the data our focus has been on a feature rich user interface with comprehensive and easy-to-use search capabilities.

[1]A Goffeau, BG Barrell, H Bussey, RW Davis, B Dujon, H Feldmann, F Galibert, JD Hoheisel, C Jacq, M Johnston et al: Life with 6000 genes. Science 1996, 274:563-7.
[2]TT Binnewies, Y Motro, PF Hallin, O Lund, D Dunn, T La, DJ Hampson, M Bellgard, TM Wassenaar, DW Ussery: Ten years of bacterial genome sequencing: comparative-genomics-based discoveries. Funct Integr Genomics 2006, 6:165-185.
[3]JE Galagan, MR Henn, LJ Ma, CA Cuomo, B Birren: Genomics of the fungal kingdom: insights into eukaryotic biology. Genome Res 2005, 15:1620-31.
[4]F Odronitz, M Hellkamp, M Kollmar: diArk – a resource for eukaryotic genome research. BMC Genomics 2007, 8:103.
[5]B Hammesfahr, F Odronitz, M Hellkamp, M Kollmar: diArk 2.0 provides detailed analyses of the ever increasing eukaryotic genome sequencing data. BMC Res Notes 2011, 4:338.
[6]M Kollmar, L Kollmar, B Hammesfahe, D Simm: diArk - the database for eukaryotic genome and transcriptome assemblies in 2014. Nucleic Acids Res 2014, Epub ahead of print.

CyMoBase - a database for cytoskeletal and motor proteins


Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Since genome sequence data is rapidly accumulating it is very important to have a reference database for the nomenclature and phylogenetic relation of the proteins that allows the most accurate assignment of biological function possible. CyMoBase is a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated during manual genome annotation and comparative genomics [1]. It offers many analysis tools like extensive statistics or a BLAST service.

[1]F Odronitz, M Kollmar: Pfarao: A web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase). BMC Genomics 2006, 7:300.

Scipio - eukaryotic gene identification


In the post-genome era, sequence data is the entry point for many studies. Often, it is of high relevance to obtain the correct genomic DNA sequences of eukaryotic genes because of the important information contained in non-coding regions. For example, the intron regions contain important sites for the regulation of gene transcription like enhancers, repressors, and silencers [1]. The determination of the exon/intron structures of genes is also important in comparative genomic analysis like the identification of ancient exons [2].
Currently, two programs are available for the retrieval of non-coding sequence. The Java application Retrieval of Regulative Regions (RRE) parses annotation and homology data from NCBI [3]. RRE requires local installation and a local copy of the desired genomes and annotation files. The web application of RRE only hosts a small number of eukaryotic genomes and only annotation data from NCBI. Recently, the non-coding sequences retrieval system (NCSRS) has been published [4] that has 16 genomes and annotation data from both NCBI and Ensembl. In summary, both tools only parse annotation files provided by NCBI and Ensembl for a few organisms.
We have developed Scipio for the retrieval of the genome sequence corresponding to a protein query. The tool does not require any annotation data, and is able to correctly identify the gene even if this is spread on several genome contigs and contains mismatches and frameshifts. Because of its post-processing capabilities, Scipio is not only able to correctly identify the gene in the genome corresponding to the protein query but also to correctly identify the homologous genes in the genomes of closely related organisms.

[1]L Fedorova, A Fedorov: Introns in gene evolution. Genetica 2003, 118:123-31.
[2]M Irimia, JL Rukov, D Penny, SW Roy: Functional and evolutionary analysis of alternatively spliced genes is consistent with an early eukaryotic origin of alternative splicing. BMC Evol Biol. 2007, 7:188.
[3]F Lazzarato, G Franceschinis, M Botta, F Cordero, RA Calogero: RRE: a tool for the extraction of non-coding regions surrounding annotated genes from genomic datasets. Bioinformatics 2004, 20:2848-2850.
[4]ST Doh, Y Zhang, MH Temple, L Cai: Non-coding sequence retrieval system for comparative genomic analysis of gene regulatory elements. BMC Bioinformatics 2007, 8:94.

Kassiopeia - analysing mutually exclusive exomes


Alternative splicing is an important process in higher eukaryotes that allows generating several transcripts out of one gene. One type of alternative splicing is mutually exclusive splicing, which refers to the splicing of exactly one exon out of a cluster of neighbouring exons into the mature transcript. Mutations in already one of the exons can lead to human diseases. Recently, we have developed a new algorithm for the prediction of these exons based on the preconditions that the exons of the cluster have similar lengths, sequence homology, and conserved splice sites, and that they are translated in the same reading frame [1].
Kassiopeia is a web application for the generation, storage, and presentation of genome-wide analyses of mutually exclusive exonomes [2]. In the prediction pipeline relaxed parameters were used to also identify very divergent exons at the expense of potentially incorporating false positive predictions. However, Kassiopeia provides many filters to adjust these parameters to narrow the results further down at the analysis level. Kassiopeia also offers several search options so that users can analyse the data from the whole genome scale to the single gene case. In order to provide test data we predicted the mutually exclusive exonomes of twelve sequenced Drosophila species, of the plant Arabidopsis thaliana, of the flatworm Caenorhabditis elegans, and of human.

[1]H Pillmann, K Hatje, F Odronitz, B Hammesfahr, M Kollmar BMC Bioinformatics 2011, 12:270.
[2]K Hatje, M Kollmar BMC Genomics 2014, 15:115.

Waggawagga - Coiled-coil and single-alpha-helix domain prediction


Coiled-coil predictions are characterized by contiguous heptad repeats, which can be depicted in the form of a net-diagram. From this representation a score has been developed, which enables the discrimination between coiled-coil-domains and single a-helices. The software was implemented as a web-application and comes with an user-optimized interface [1]. The user can run applications for the sole prediction of coiled-coils and applications for the prediction of the oligomerisation states. The query sequence is visualized as helical wheel-diagram of parallel or anti-parallel homodimers, or parallel homotrimers, and as heptad-net-diagram. In addition the SAH-score is calculated for each prediction. Considered together, these information provide an indication for the correct prediction of the structural motives. The results of the application can be stored, exported to files and be restored for later analysis.

[1]D Simm, K Hatje, M Kollmar Bioinformatics 2015, 31:5.
[2]D Simm, K Hatje, M Kollmar PLoS ONE 2017, 12:4.

Bagheera - Predicting CUG codon translation in yeasts


The translation of DNA into protein is not same for all organisms. Some yeast uses an alternative translation of the leucine codon CUG. In the Schizosaccharomyces and the Saccharomyces clades CUG is translated as leucine, while species of the “Candida clade” translate CUG as serine. Species names are not sufficient to assign the correct translation scheme. In detail, some species with telemorph or anamorph named Candida translate CUG as leucine. In the figure, all species with telemorph or anamorph named Candida are highlighted in bold.

This webserver is designed to detect the most probable translation scheme of a given species [1].

[1]S Mühlhausen, M Kollmar BMC Genomics 2014, 15:411.

Peakr - simulating solid-state NMR spectra of proteins


When analyzing solid-state nuclear magnetic resonance (NMR) spectra of proteins, assignment of resonances to nuclei and derivation of restraints for 3D structure calculations are challenging and time-consuming processes. Simulated spectra that have been calculated based on, for example, chemical shift predictions and structural models can be of considerable help. Existing solutions are typically limited in the type of experiment they can consider and difficult to adapt to different settings.
Here, we present Peakr, a software to simulate solid-state NMR spectra of proteins [1]. It can generate simulated spectra based on numerous common types of internuclear correlations relevant for assignment and structure elucidation, can compare simulated and experimental spectra and produces lists and visualizations useful for analyzing measured spectra. Compared with other solutions, it is fast, versatile and user friendly.

[1]R Schneider, F Odronitz, B Hammesfahr, M Hellkamp, M Kollmar Bioinformatics 2013, 29:1134-1140.

ShereKhan - calculating molecular exchange rates


Dynamics governing the function of biomolecule is usually described as exchange processes and can be monitored at atomic resolution with nuclear magnetic resonance (NMR) relaxation dispersion data. Here, we present a new tool for the analysis of CPMG relaxation dispersion profiles (ShereKhan) [1]. The web interface to ShereKhan provides a user-friendly environment for the analysis.

[1]A Mazur, B Hammesfahr, C Griesinger, D Lee, M Kollmar Bioinformatics 2013, 29:1819-1820.
link to diark
link to cymobase
link to scipio
MPI for biophysical chemistry