Pabinger S, et al. (2014). A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinformatics 15(2):256-278.

NGS pipeline


Case studies

Lagana A, et al. (2018). Precision medicine for relapsed multiple myeloma on the basis of an integrative multiomics approach. JCO Prec Oncol. Data Suppl,

Lu X-M, et al. (2018). Association of breast and ovarian cancers with predisposition genes identified by large-scale sequencing. JAMA Oncol, doi:10.1001/jamaoncol.2018.2956.

Mestek-Boukhibar L, et al. (2018). Rapid Paediatric Sequencing (RaPS): comprehensive real-life workflow for rapid diagnosis of critically ill children. J Med Genet, doi:10.1136/jmedgenet-2018-105396

Castel SE, et al. (2018). Modified penetrance of coding variants by cis-regulatory variation contributes to disease risk.Nat Genet,

Dixon JR, et al. (2018). Integrative detection and analysis of structural variation in cancer genomes. Nat Genet,

Wood DE, et al. (2018). A machine learning approach for somatic mutation discovery. Sci. Transl. Med. 10, eaar7939 (2018) DOI: 10.1126/scitranslmed.aar7939

Agotron detection

The following is according to as described in

Hansen TB (2018). Detecting Agotrons in Ago CLIPseq Data. in Vang ├śrom UA (ed) miRNA Biogenesis-Methods and Protocols, Chapter 17, 221-232. Springer.

tar -zxvf chromFa.tar.gz
cat *.fa > hg19.fa
samtools faidx hg19.fa
bowtie2-build hg19.fa hg19
# GSE78059
for srr in 008/SRR3177718/SRR3177718 009/SRR3177719/SRR3177719 000/SRR3177720/SRR3177720 001/SRR3177721/SRR3177721 002/SRR3177722/SRR3177722 003/SRR3177723/SRR3177723
trim_galore -A TCAGTCACTTCCAGC -length 18 *.fastq.gz
for i in *_trimmed.fq.gz
    echo $i
    bowtie2 -q --local -x hg19 -U $i | samtools sort - > $i.sort.bam    
    samtools index $i.sort.bam

python | python -g hg19.fa | Rscript annotater.R

Note that it is easier to implement with prefetch as shown below.

Alignment and variant calling tutorial

See Note that E.coli_K12_MG1655.fa is unavailable any more, instead we have to download it directly from NCBI,, choose FASTA (text), to reach$=seqview&format=text and save to a local file, whose empty lines have to be removed, see them with awk '(length($NF)==0){print NR}' E.coli_K12_MG1655.fa.

The fastq-dump generates .fa files, which need to be compressed with gzip.


Exome sequencing analysis

IMSGC (2018). Low-frequency and rare-coding variation contributes to multiple sclerosis risk. Cell. DOI:

has associate software,

CNV detection

CN-Learn is a framework to integrate Copy Number Variant (CNV) predictions made by multiple algorithms using exome sequencing datasets.

Pounraja VK, et al. (2019) A machine-learning approach for accurate detection of copy-number variants from exome sequencing. Genome Res.

SNP discovery

The following reference discribes several pipelines for SNP discovery.

Morin PA, Foote AD, Hill CM, Simon-Bouhet B, Lang AR, Louis M (2018). SNP Discovery from Single and Multiplex Genome Assemblies of Non-model Organisms, in Head SR, et al. (eds.), Next Generation Sequencing: Methods and Protocols, Chapter 9, 113-144, Springer.

whose scripts are available from

See also and the following references,

Martin J, Schackwitz W, Lipzen A (2018). Genomic Sequence Variation Analysis by Resequencing, in de Vries RP, Tsang A, Grigoriev IV (ed) Fungal Genomics-Methods and Protocols, 2e, Chapter 18, 229-239, Springer.

Raghavachari N, Garcia-Reyero N (eds.) (2018), Gene Expression Analysis-Methods and Protocols, Springer.


Mejia-Guerra MK, et al. (2018). Genome-Wide TSS Identification in Maize. Chapter 14, 239-256, in Yamaguchi N (ed.), Plant Transcription Factors-Methods and Protocols, Springer

Comparison of gene expression pipelines on RNA-seq sequencing data.

GSNAP, MapSplice, RUM, STAR, RNA-seq pipeline

# gsnap
tar xfz gmap-gsnap-2018-07-04.tar.gz
cd gmap-2018-07-04
sudo make install
# mapsplice, the latest version from has compiling issue
sudo `which conda` install mapsplice
# rum
git clone
cd rum
perl Makefile.PL
sudo make install
git clone
cd STAR/source

See for RNA-seq pipeline.

Mendelian RNA-seq

The relevant installations:

conda create --name mendelian-rnaseq-env
source activate mendelian-rnaseq-env
conda install -c bioconda snakemake
conda install -c bioconda rna-seqc
conda install -c bioconda gatk
conda install -c biobuilds plink
conda install -c bioconda star
conda install -c bioconda picard
conda install -c bioconda bwa
conda install -c anaconda colorama
conda install -c bioconda misopy

sra-toolkit, tophat

These are very straightforward, e.g.,

prefetch -v SRR3534842
fastq-dump --split-files --gzip SRR3534842

the SRR3534842.sra from prefetch is actually at $HOME/ncbi/public/sra which is split into SRR3534842_1.fastq.gz, SRR3534842_2.fastq.gz at the current directory. See However, the location may not desirable since it may create a huge .vdi files with VirtualBox -- to get around we do this

cd $HOME
mkdir -p /home/jhz22/D/work/ncbi/public/sra
ln -sf /home/jhz22/D/work/ncbi

where D is actually a shared folder at Windows.

To run tophat, see

tar xvfz test_data.tar.gz
cd test_data
tophat -r 20 test_ref reads_1.fq reads_2.fq



It is a tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data,


ANGSD is a software for analyzing next generation sequencing data, It is relatively straightforward with GitHub; after

git clone
cd angsd

but the following change is needed on line 468 of misc/msHOT2glf.c: tmppch as in (tmppch=='\0') should be *tmppch as in (*tmppch==''0'), suggested by the compiler.

Ubuntu archive

This include bamtools, bcftools, bedops, bedtools, blast (ncbi-blast+), bowtie2, fastqc, fastx-toolkit, freebayes, hmmer, hisat2, picard-tools, rsem, sambamba, samtools, seqtk, sra-toolkit, subread, tophat, trinityrnaseq, vcftool, vowpal-wabbit.

Install with sudo apt install.

See also

Besides notes above, this is also possible:


tar jfx bcftools-1.9.tar.bz2
cd bcftools-1.9
./configure --prefix=$HPC_WORK
make install

It is necessary to set the environment variables to enable plugins, so we could generate a version at $HPC_WORK/bin instead,


export BCFTOOLS_PLUGINS=$HPC_WORK/bcftools-1.9/plugins
$HPC_WORK/bcftools-1.9/bcftools "$@"

We invoke bcftools +check_ploidy my.vcf.gz Interestingly, this also save space!


The project home is, whereby

cd bowtie2-

The test is then self-contained,

export BT2_HOME=/home/jhz22/D/genetics/bowtie2-

$BT2_HOME/bowtie2-build $BT2_HOME/example/reference/lambda_virus.fa lambda_virus
$BT2_HOME/bowtie2 -x lambda_virus -U $BT2_HOME/example/reads/reads_1.fq -S eg1.sam
$BT2_HOME/bowtie2 -x $BT2_HOME/example/index/lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam

samtools view -bS eg2.sam > eg2.bam
samtools sort eg2.bam -o eg2.sorted.bam
samtools mpileup -uf $BT2_HOME/example/reference/lambda_virus.fa eg2.sorted.bam | bcftools view -Ov - > eg2.raw.bcf
bcftools view eg2.raw.bcf

Like samtools, etc. it is possible to involve sudo apt install bowtie2.


Genome-wide copy number from high-throughput sequencing, available from

cutadapt, TrimGalore

A prerequesite is to install cython.

git clone
cd cutadapt
sudo python install
git clone


It is a deep neural network to call genetic variants from next-generation DNA sequencing data,



git clone
cd Exomiser
mvn package

See also


The scripts divides a large FASTQ file into a set of smaller equally sized files,

fastx_toolkit, RSEM

It is also available from along with, and do away with the notorious automake-1.14 problem associated with sources at

However, line 105 of src/fasta_formatter/fasta_formatter.cpp requires usage() followed by exit(0); as suggested in the issue section. More oever, usage() is a void function so its own exit(0) is unnecessary.

The GitHub pages for RSEM are and It is also recommended that the Bioconductor package EBSeq be installed.



git clone --recursive
sudo make install


The source is available from but it is more convenient to use

ln -s `pwd`/gatk $HOME/bin/gatk
gatk --help
gatk --list

hisat2, sambamba, picard-tools, StringTie

Except StringTie, this is overlapped with apt install above,

brew tap brewsci/bio
brew tap brewsci/science
brew install hisat2
brew install sambamba
brew install picard-tools
brew install stringtie

It could be useful with ``brew reinstall```. See

Raghavachari N, Garcia-Reyero N (eds.) (2018), Gene Expression Analysis-Methods and Protocols,, Chapter 15, Springer.

Nevertheless it may be slower, e.g., tophat, compared to sudo apt install.


The download can be seeded from, e.g.,

Again the source code is from GitHub, For developers, igv.js is very appealing.



INserted Sequence Information DEtectoR (INSIDER) analyses whole genome sequencing data and identifies segments of potentially foreign origin by their significant shift in k-mer signatures.

Tay, A.P., Hosking, B., Hosking, C., Bauer, D.C. & Wilson, L.O.W. INSIDER: alignment-free detection of foreign DNA sequences. Computational and Structural Biotechnology Journal 19, 3810-3816 (2021).


From the GitHub repository, it is seen to use project object model (POM), an XML representation of a Maven project held in a file named pom.xml. We therefore install maven first,

sudo apt install maven

The installation then proceeds as follows,

git clone
cd jannovar
mvn package

Other tasks such as compile, test, etc. are also possible.

It is handy to use symbolic link, i.e.,

ln -s /home/jhz22/D/genetics/jannovar/jannovar-cli/target/jannovar-cli-0.24.jar $HOME/bin/Jannovar.jar
java -jar $HOME/bin/Jannovar.jar db-list
java -jar $HOME/bin/Jannovar.jar download -d hg19/refseq

We may need to set memory size, e.g.,

java -Xms2G -Xmx4G -jar $HOME/bin/Jannovar.jar


MEthyLation Inference for Single cell Analysis (Melissa), is a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq). Melissa clusters individual cells based on local methylation patterns, enabling the discovery of epigenetic diversities and commonalities among individual cells. The clustering also acts as an effective regularisation method for imputation of methylation on unassayed CpG sites, enabling transfer of information between individual cells.

Kapourani C-A, Sanguinetti G (2019). Melissa: Bayesian clustering and imputation of single-cell methylomes, Genome Biology 20:61,


Cmero, M. et al. MINTIE: identifying novel structural and splice variants in transcriptomes using RNA-seq data. Genome Biology 22, 296 (2021).


Jiang D, McPeek MS (2014). Robust Rare Variant Association Testing for Quantitative Traits in Samples with Related Individuals. Genetic Epidemiology 38(1):10-20


The software can be obtained from

After htslib is installed, the canonical instruction is to issue

git clone
cd htslib
sudo make install
cd -
git clone
cd pindel
./INSTALL ../htslib

It is 'standard' to have complaints about pindel.cpp, bddate.cpp and genotyping.cpp, for abs() rather than fabs() from the header file cmath have been used. The issue goes away when abs is replaced with fabs and in the case of bddata.cpp, it is also necessary to invoke the header, i.e.,

#include <cmath>


It is available from and GitHub,

git clone
dir dist


While the source contains ldc2, it is readily available with Ubuntu archive nevertheless failed to compile, so we proceed with instructions at the GitHub, e.g.,

export PATH=$HOME/ldc2-1.10.0-linux-x86_64/bin:$PATH
export LIBRARY_PATH=$HOME/ldc2-1.10.0-linux-x86_64/lib

for version 1.10.0.


To build from source, we do these,

git clone
cd htslib
cd -
git clone
cd samtools
autoheader            # Build (this may generate a warning about
                      # AC_CONFIG_SUBDIRS - please ignore it).
autoconf -Wno-syntax  # Generate the configure script
./configure           # Needed for choosing optional functionality
make install

Note bgzip and tabix are distributed with htslib. It is relatively easier to install from release,

tar xjf samtools-1.9.tar.gz
cd samtools-1.9
./configure --prefix=/scratch/jhz22
make install
cd -
cd htslib-1.9
./configure --prefix=/scratch/jhz22
make install

where we install tabix as well.

SnpEff, SnpSift, clinEff

It is straightforward with the compiled version from sourceforge, which also includes clinEff.

unzip snpEff_latest_core
cd snpEff
java -jar snpEff.jar databases
java -jar snpEff.jar download GRCh38.76
tar fvxz test_cases.tgz

lists all the databases and download a particular one. Later, the test files are also downloaded and extracted.

The following steps compile from source instead.

git clone
cd SnpEff
mvn package
mvn install
cd lib
# Antlr
mvn install:install-file \
    -Dfile=antlr-4.5.1-complete.jar \
    -DgroupId=org.antlr \
    -DartifactId=antlr \
    -Dversion=4.5.1 \

# BioJava core
mvn install:install-file \
    -Dfile=biojava3-core-3.0.7.jar \
    -DgroupId=org.biojava \
    -DartifactId=biojava3-core \
    -Dversion=3.0.7 \

# BioJava structure
mvn install:install-file \
    -Dfile=biojava3-structure-3.0.7.jar \
    -DgroupId=org.biojava \
    -DartifactId=biojava3-structure \
    -Dversion=3.0.7 \

cd -

# SnpSift
git clone
cd SnpSift
mvn package
mvn install

which gives target/SnpEff-4.3.jar and target/SnpSift-4.3.jar, respectively.

Note that antlr4 is from GitHub, See also


It is available from


Short for Structural Variant detrction in Circulating Tumor DNA and is available from



RNA-Seq De novo Assembly Using Trinity,


Hosted at, the .jar files are ready to use with

git clone

or from the repository releases.

See for further information.


Assuming that we use zlib 1.2.8 from module zlib/1.2.8, we can do the following,

tar xvfz vcftools-0.1.16.tar.gz
module load zlib/1.2.8
./configure --prefix=/scratch/jhz22 ZLIB_CFLAGS="-I/usr/local/Cluster-Apps/zlib/1.2.8/include" ZLIB_LIBS="-L/usr/local/Cluster-Apps/zlib/1.2.8/lib -lz"
make install

To use vcf-concat, it is necessary to set the PERL5LIB environment variables, e.g.,

export PERL5LIB=/scratch/jhz22/share/perl5


Allele-specific pipeline for unbiased read mapping and molecular QTL discovery,