Computational Genomics
Software
(Mostly in reverse chronological order).
Read Cloud Alignment
RFA (Random Field Aligner). A method for aligning barcoded reads generated by a read cloud protocol such as 10X or Moleculo. Reads of the same barcode are aligned jointly to the reference genome. As a result, many of the repeats in the genome can be mapped.
Cancer Cell Lineage Phylogeny Inference
LICHeE: multi-sample cancer phylogeny reconstruction. LICHeE automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of somatic SNVs to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples.
POMEGRANATE: cancer lineage tree simulator. POMEGRANATE simulates tumor progression from normal tissue producing a branching hierarchy of monoclonal cell populations in accordance with the branched-tree cancer evolution model. Multiple samples can then be collected from the produced lineage tree(s), each sample consisting of one or several cell populations.
Populations of Genomes
Reveel: population variant calling and genotyping using low-coverage sequencing. Reveel is an ultrafast tool for single nucleotide variant calling and genotyping of large cohorts that have been sequenced at low coverage.
BWBBLE: short read alignment to a population of genomes. A BWT-based aligner that maps short reads to a compressed linear representation of a collection of genomes (the reference 'multi-genome'). Alignment is not biased towards a single reference genome, and is much faster than aligning to each genome individually.
Identity by Descent (IBD) and Relatedness
Parente and Parente2. An ultrafast method for detecting relatedness and finding segments of identity-by-descent (IBD) across a population. Parente2 is accurate in detecting IBD segments with resolution up to 2 cM. See also SpeeDB, a tool employed in the Parente2 pipeline for further speed improvement.
SpeeDB: a filter for IBD detection. SpeeDB significantly increases the efficiency of IBD detection in large-scale unphased genotype data sets by rapidly screening out genomic regions that are unlikely to be IBD. The remainder of genomic regions can be passed onto traditional IBD inference methods.
CARROT (ClAssification of Relationships with ROTations) is a tool for relationship inference that leverages linkage disequilibrium to differentiate between rotated relationships, such as (first-, second-, etc) uncle-niece.
Ancestry Inference
ALLOY: ancestry painting. Alloy is a fast and accurate method for painting the population ancestries of an individual along their chromosomes. It combines the idea of explicit modeling of haplotypes of our earlier HAPAA tool with a compressed representation of each population using BEAGLE.
HAPAA (HMM-based Analysis of Polymorphisms in Admixed Ancestries). HAPAA is our earlier HMM-based program that performs ancestry painting by a model where every haplotype in the model populations is explicitly expressed as a chain in the HMM.
Constrained Element Prediction
GERP++: Genome Evolutionary Rate Profiling. Given a multiple genomic sequence alignment, GERP++ identifies constrained elements by finding statistically significantly runs of alignment columns with substitution "deficit": fewer base pair substitutions than expected by the total tree branch length.
RNA Folding
CONTRAfold (CONditional TRAining for RNA secondary structure prediction). CONTRAfold is an RNA secondary structure prediction method based on a conditional log-linear model. It was the first to outperform physics-based methods such as Mfold, and still widely used. See also RAF.
RAF (RNA Alignment and Folding). RAF is a program for multiple RNA alignment and folding. Given a set of homologous RNA sequences, RAF computes a multiple alignment and consensus structure prediction using a progressive simultaneous alignment and folding algorithm.
Gene Finding
CONTRAST (CONditionally TRAined search for transcripts). CONTRAST predicts protein-coding genes from a multiple genomic alignment using a combination of discriminative machine learning techniques.
Protein Interaction Networks
Graemlin (General and Robust Alignment of Multiple Interaction Networks). Graemlin is our suite of tools for global and local pairwise and multiple alignment of protein interaction networks.
Sequence Alignment
CONTRALIGN (CONditional TRAining for protein sequence alignment). CONTRAlign is a parameter learning framework for protein pairwise sequence alignment based on pair conditional random fields.
ProDA. A software for multiple alignment of protein sequences with repeats and shuffle elements. A fun collaboration between our lab, our visitor Tu Minh Phuong, and Robert Edgar (formerly known as Bob Edgar) of MUSCLE fame.
ProbCons (Probabilistic Consistency-based multiple alignment). ProbCons is a protein multiple sequence alignment tool that introduced probabilistic consistency during multiple alignment and led to substantial improvement compared to previous state-of-the-art methods.
LAGAN and MLAGAN alignment toolkit. Our highly popular pairwise and multiple genome aligners. Even though they were published in 2002, they are still routinely installed and utilized.
RFA (Random Field Aligner). A method for aligning barcoded reads generated by a read cloud protocol such as 10X or Moleculo. Reads of the same barcode are aligned jointly to the reference genome. As a result, many of the repeats in the genome can be mapped.
Cancer Cell Lineage Phylogeny Inference
LICHeE: multi-sample cancer phylogeny reconstruction. LICHeE automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of somatic SNVs to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples.
POMEGRANATE: cancer lineage tree simulator. POMEGRANATE simulates tumor progression from normal tissue producing a branching hierarchy of monoclonal cell populations in accordance with the branched-tree cancer evolution model. Multiple samples can then be collected from the produced lineage tree(s), each sample consisting of one or several cell populations.
Populations of Genomes
Reveel: population variant calling and genotyping using low-coverage sequencing. Reveel is an ultrafast tool for single nucleotide variant calling and genotyping of large cohorts that have been sequenced at low coverage.
BWBBLE: short read alignment to a population of genomes. A BWT-based aligner that maps short reads to a compressed linear representation of a collection of genomes (the reference 'multi-genome'). Alignment is not biased towards a single reference genome, and is much faster than aligning to each genome individually.
Identity by Descent (IBD) and Relatedness
Parente and Parente2. An ultrafast method for detecting relatedness and finding segments of identity-by-descent (IBD) across a population. Parente2 is accurate in detecting IBD segments with resolution up to 2 cM. See also SpeeDB, a tool employed in the Parente2 pipeline for further speed improvement.
SpeeDB: a filter for IBD detection. SpeeDB significantly increases the efficiency of IBD detection in large-scale unphased genotype data sets by rapidly screening out genomic regions that are unlikely to be IBD. The remainder of genomic regions can be passed onto traditional IBD inference methods.
CARROT (ClAssification of Relationships with ROTations) is a tool for relationship inference that leverages linkage disequilibrium to differentiate between rotated relationships, such as (first-, second-, etc) uncle-niece.
Ancestry Inference
ALLOY: ancestry painting. Alloy is a fast and accurate method for painting the population ancestries of an individual along their chromosomes. It combines the idea of explicit modeling of haplotypes of our earlier HAPAA tool with a compressed representation of each population using BEAGLE.
HAPAA (HMM-based Analysis of Polymorphisms in Admixed Ancestries). HAPAA is our earlier HMM-based program that performs ancestry painting by a model where every haplotype in the model populations is explicitly expressed as a chain in the HMM.
Constrained Element Prediction
GERP++: Genome Evolutionary Rate Profiling. Given a multiple genomic sequence alignment, GERP++ identifies constrained elements by finding statistically significantly runs of alignment columns with substitution "deficit": fewer base pair substitutions than expected by the total tree branch length.
RNA Folding
CONTRAfold (CONditional TRAining for RNA secondary structure prediction). CONTRAfold is an RNA secondary structure prediction method based on a conditional log-linear model. It was the first to outperform physics-based methods such as Mfold, and still widely used. See also RAF.
RAF (RNA Alignment and Folding). RAF is a program for multiple RNA alignment and folding. Given a set of homologous RNA sequences, RAF computes a multiple alignment and consensus structure prediction using a progressive simultaneous alignment and folding algorithm.
Gene Finding
CONTRAST (CONditionally TRAined search for transcripts). CONTRAST predicts protein-coding genes from a multiple genomic alignment using a combination of discriminative machine learning techniques.
Protein Interaction Networks
Graemlin (General and Robust Alignment of Multiple Interaction Networks). Graemlin is our suite of tools for global and local pairwise and multiple alignment of protein interaction networks.
Sequence Alignment
CONTRALIGN (CONditional TRAining for protein sequence alignment). CONTRAlign is a parameter learning framework for protein pairwise sequence alignment based on pair conditional random fields.
ProDA. A software for multiple alignment of protein sequences with repeats and shuffle elements. A fun collaboration between our lab, our visitor Tu Minh Phuong, and Robert Edgar (formerly known as Bob Edgar) of MUSCLE fame.
ProbCons (Probabilistic Consistency-based multiple alignment). ProbCons is a protein multiple sequence alignment tool that introduced probabilistic consistency during multiple alignment and led to substantial improvement compared to previous state-of-the-art methods.
LAGAN and MLAGAN alignment toolkit. Our highly popular pairwise and multiple genome aligners. Even though they were published in 2002, they are still routinely installed and utilized.
Copyright © 2015 Serafim Batzoglou