NGS

Review

Pabinger S, et al. (2014). A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinformatics 15(2):256-278.

NGS pipeline

https://www.hgsc.bcm.edu/software/mercury

Annotation

https://sites.google.com/site/jpopgen/wgsa

Case studies

Lagana A, et al. (2018). Precision medicine for relapsed multiple myeloma on the basis of an integrative multiomics approach. JCO Prec Oncol. Data Suppl, http://ascopubs.org/doi/suppl/10.1200/PO.18.00019

Lu X-M, et al. (2018). Association of breast and ovarian cancers with predisposition genes identified by large-scale sequencing. JAMA Oncol, doi:10.1001/jamaoncol.2018.2956.

Mestek-Boukhibar L, et al. (2018). Rapid Paediatric Sequencing (RaPS): comprehensive real-life workflow for rapid diagnosis of critically ill children. J Med Genet, doi:10.1136/jmedgenet-2018-105396

Castel SE, et al. (2018). Modified penetrance of coding variants by cis-regulatory variation contributes to disease risk.Nat Genet, https://doi.org/10.1038/s41588-018-0192-y.

Dixon JR, et al. (2018). Integrative detection and analysis of structural variation in cancer genomes. Nat Genet, https://www.nature.com/articles/s41588-018-0195-8

Wood DE, et al. (2018). A machine learning approach for somatic mutation discovery. Sci. Transl. Med. 10, eaar7939 (2018) DOI: 10.1126/scitranslmed.aar7939

Agotron detection

The following is according to https://github.com/ncrnalab/agotron_detector as described in

Hansen TB (2018). Detecting Agotrons in Ago CLIPseq Data. in Vang ├śrom UA (ed) miRNA Biogenesis-Methods and Protocols, Chapter 17, 221-232. Springer.

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar -zxvf chromFa.tar.gz
cat *.fa > hg19.fa
samtools faidx hg19.fa
bowtie2-build hg19.fa hg19
# GSE78059
for srr in 008/SRR3177718/SRR3177718 009/SRR3177719/SRR3177719 000/SRR3177720/SRR3177720 001/SRR3177721/SRR3177721 002/SRR3177722/SRR3177722 003/SRR3177723/SRR3177723
do
   wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR317/$srr.fastq.gz
done
trim_galore -A TCAGTCACTTCCAGC -length 18 *.fastq.gz
for i in *_trimmed.fq.gz
do
    echo $i
    bowtie2 -q --local -x hg19 -U $i | samtools sort - > $i.sort.bam    
    samtools index $i.sort.bam

done
python UCSC_intron_retriever.py | python analyzer.py -g hg19.fa | Rscript annotater.R

Note that it is easier to implement with prefetch as shown below.

Alignment and variant calling tutorial

See https://github.com/ekg/alignment-and-variant-calling-tutorial. Note that E.coli_K12_MG1655.fa is unavailable any more, instead we have to download it directly from NCBI, https://www.ncbi.nlm.nih.gov/nuccore/556503834, choose FASTA (text), to reach https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3?report=fasta&log$=seqview&format=text and save to a local file, whose empty lines have to be removed, see them with awk '(length($NF)==0){print NR}' E.coli_K12_MG1655.fa.

The fastq-dump generates .fa files, which need to be compressed with gzip.

bowtie-scaling

https://github.com/BenLangmead/bowtie-scaling

Exome sequencing analysis

IMSGC (2018). Low-frequency and rare-coding variation contributes to multiple sclerosis risk. Cell. DOI:https://doi.org/10.1016/j.cell.2018.09.049

has associate software, https://github.com/cotsapaslab/IMSGCexomechip.

CNV detection

CN-Learn is a framework to integrate Copy Number Variant (CNV) predictions made by multiple algorithms using exome sequencing datasets.

https://github.com/girirajanlab/CN_Learn

Pounraja VK, et al. (2019) A machine-learning approach for accurate detection of copy-number variants from exome sequencing. Genome Res.

SNP discovery

The following reference discribes several pipelines for SNP discovery.

Morin PA, Foote AD, Hill CM, Simon-Bouhet B, Lang AR, Louis M (2018). SNP Discovery from Single and Multiplex Genome Assemblies of Non-model Organisms, in Head SR, et al. (eds.), Next Generation Sequencing: Methods and Protocols, Chapter 9, 113-144, Springer.

whose scripts are available from https://github.com/PAMorin/SNPdiscovery/.

See also https://github.com/sanger-pathogens/snp-sites and the following references,

Martin J, Schackwitz W, Lipzen A (2018). Genomic Sequence Variation Analysis by Resequencing, in de Vries RP, Tsang A, Grigoriev IV (ed) Fungal Genomics-Methods and Protocols, 2e, Chapter 18, 229-239, Springer.

Raghavachari N, Garcia-Reyero N (eds.) (2018), Gene Expression Analysis-Methods and Protocols, Springer.

TSS

Mejia-Guerra MK, et al. (2018). Genome-Wide TSS Identification in Maize. Chapter 14, 239-256, in Yamaguchi N (ed.), Plant Transcription Factors-Methods and Protocols, Springer

Comparison of gene expression pipelines on RNA-seq sequencing data.

http://statapps.ugent.be/tools/AppDGE/

GSNAP, MapSplice, RUM, STAR, RNA-seq pipeline

# gsnap
wget http://research-pub.gene.com/gmap/src/gmap-gsnap-2018-07-04.tar.gz
tar xfz gmap-gsnap-2018-07-04.tar.gz
cd gmap-2018-07-04
./configure
make
sudo make install
# mapsplice, the latest version from http://protocols.netlab.uky.edu/~zeng/MapSplice-v2.2.1.zip has compiling issue
sudo `which conda` install mapsplice
mapsplice.py
# rum
git clone https://github.com/itmat/rum
cd rum
perl Makefile.PL
make
sudo make install
# STAR
git clone https://github.com/alexdobin/STAR
cd STAR/source
make

See https://github.com/sanger-pathogens/Bio-RNASeq for RNA-seq pipeline.

Mendelian RNA-seq

https://github.com/komalsrathi/MendelianRNA-seq

The relevant installations:

conda create --name mendelian-rnaseq-env
source activate mendelian-rnaseq-env
conda install -c bioconda snakemake
conda install -c bioconda rna-seqc
conda install -c bioconda gatk
conda install -c biobuilds plink
conda install -c bioconda star
conda install -c bioconda picard
conda install -c bioconda bwa
conda install -c anaconda colorama
conda install -c bioconda misopy

sra-toolkit, tophat

These are very straightforward, e.g.,

prefetch -v SRR3534842
fastq-dump --split-files --gzip SRR3534842

the SRR3534842.sra from prefetch is actually at $HOME/ncbi/public/sra which is split into SRR3534842_1.fastq.gz, SRR3534842_2.fastq.gz at the current directory. See https://www.biostars.org/p/111040/. However, the location may not desirable since it may create a huge .vdi files with VirtualBox -- to get around we do this

cd $HOME
mkdir -p /home/jhz22/D/work/ncbi/public/sra
ln -sf /home/jhz22/D/work/ncbi

where D is actually a shared folder at Windows.

To run tophat, see https://ccb.jhu.edu/software/tophat/tutorial.shtml

wget https://ccb.jhu.edu/software/tophat/downloads/test_data.tar.gz
tar xvfz test_data.tar.gz
cd test_data
tophat -r 20 test_ref reads_1.fq reads_2.fq

Software

adVNTR

It is a tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data, https://github.com/mehrdadbakhtiari/adVNTR.

ANGSD

ANGSD is a software for analyzing next generation sequencing data, http://www.popgen.dk/angsd/index.php/ANGSD. It is relatively straightforward with GitHub; after

git clone https://github.com/ANGSD/angsd
cd angsd
make

but the following change is needed on line 468 of misc/msHOT2glf.c: tmppch as in (tmppch=='\0') should be *tmppch as in (*tmppch==''0'), suggested by the compiler.

Ubuntu archive

This include bamtools, bcftools, bedops, bedtools, blast (ncbi-blast+), bowtie2, fastqc, fastx-toolkit, freebayes, hmmer, hisat2, picard-tools, rsem, sambamba, samtools, seqtk, sra-toolkit, subread, tophat, trinityrnaseq, vcftool, vowpal-wabbit.

Install with sudo apt install.

See also https://github.com/lh3/seqtk.

Besides notes above, this is also possible:

bcftools

wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
tar jfx bcftools-1.9.tar.bz2
cd bcftools-1.9
./configure --prefix=$HPC_WORK
make
make install

It is necessary to set the environment variables to enable plugins, so we could generate a version at $HPC_WORK/bin instead,

#!/usr/bin/bash

export BCFTOOLS_PLUGINS=$HPC_WORK/bcftools-1.9/plugins
$HPC_WORK/bcftools-1.9/bcftools "$@"

We invoke bcftools +check_ploidy my.vcf.gz Interestingly, this also save space!

bowtie2

The project home is https://sourceforge.net/projects/bowtie-bio, whereby

wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.4.1/bowtie2-2.3.4.1-linux-x86_64.zip
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.4.1/bowtie2-2.3.4.1-source.zip
unzip bowtie2-2.3.4.1-linux-x86_64.zip
cd bowtie2-2.3.4.1-linux-x86_64/

The test is then self-contained,

export BT2_HOME=/home/jhz22/D/genetics/bowtie2-2.3.4.1-linux-x86_64

$BT2_HOME/bowtie2-build $BT2_HOME/example/reference/lambda_virus.fa lambda_virus
$BT2_HOME/bowtie2 -x lambda_virus -U $BT2_HOME/example/reads/reads_1.fq -S eg1.sam
$BT2_HOME/bowtie2 -x $BT2_HOME/example/index/lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam

samtools view -bS eg2.sam > eg2.bam
samtools sort eg2.bam -o eg2.sorted.bam
samtools mpileup -uf $BT2_HOME/example/reference/lambda_virus.fa eg2.sorted.bam | bcftools view -Ov - > eg2.raw.bcf
bcftools view eg2.raw.bcf

Like samtools, etc. it is possible to involve sudo apt install bowtie2.

CNVkit

Genome-wide copy number from high-throughput sequencing, available from https://cnvkit.readthedocs.io/en/stable/

cutadapt, TrimGalore

A prerequesite is to install cython.

git clone https://github.com/marcelm/cutadapt
cd cutadapt
sudo python setup.py install
git clone https://github.com/FelixKrueger/TrimGalore

DeepVariant

It is a deep neural network to call genetic variants from next-generation DNA sequencing data, https://github.com/google/deepvariant.

EasyQC

http://www.niot.res.in/EasyQC/

Exomiser

git clone https://github.com/exomiser/Exomiser
cd Exomiser
mvn package

See also https://github.com/exomiser/exomiser-demo.

fastq-splitter

The scripts divides a large FASTQ file into a set of smaller equally sized files, http://kirill-kryukov.com/study/tools/fastq-splitter/.

fastx_toolkit, RSEM

It is also available from https://github.com/agordon/fastx_toolkit along with https://github.com/agordon/libgtextutils, and do away with the notorious automake-1.14 problem associated with sources at http://hannonlab.cshl.edu/fastx_toolkit/download.html.

However, line 105 of src/fasta_formatter/fasta_formatter.cpp requires usage() followed by exit(0); as suggested in the issue section. More oever, usage() is a void function so its own exit(0) is unnecessary.

The GitHub pages for RSEM are https://github.com/deweylab/RSEM and https://deweylab.github.io/RSEM/. It is also recommended that the Bioconductor package EBSeq be installed.

freebayes

Try

git clone --recursive https://github.com/ekg/freebayes
make
sudo make install

GATK

The source is available from https://github.com/broadinstitute/gatk/ but it is more convenient to use https://github.com/broadinstitute/gatk/releases/.

ln -s `pwd`/gatk $HOME/bin/gatk
gatk --help
gatk --list

hisat2, sambamba, picard-tools, StringTie

Except StringTie, this is overlapped with apt install above,

brew tap brewsci/bio
brew tap brewsci/science
brew install hisat2
hisat2-build
brew install sambamba
brew install picard-tools
brew install stringtie

It could be useful with ``brew reinstall```. See

Raghavachari N, Garcia-Reyero N (eds.) (2018), Gene Expression Analysis-Methods and Protocols, https://www.springer.com/us/book/9781493978335, Chapter 15, Springer.

Nevertheless it may be slower, e.g., tophat, compared to sudo apt install.

IGV

The download can be seeded from http://data.broadinstitute.org/igv, e.g., http://data.broadinstitute.org/igv/projects/downloads/2.4/IGV_2.4.10.zip.

Again the source code is from GitHub, https://github.com/igvteam/igv/. For developers, igv.js is very appealing.

INSIDER

Web: https://github.com/aehrc/INSIDER

INserted Sequence Information DEtectoR (INSIDER) analyses whole genome sequencing data and identifies segments of potentially foreign origin by their significant shift in k-mer signatures.

Tay, A.P., Hosking, B., Hosking, C., Bauer, D.C. & Wilson, L.O.W. INSIDER: alignment-free detection of foreign DNA sequences. Computational and Structural Biotechnology Journal 19, 3810-3816 (2021).

Jannovar

From the GitHub repository, it is seen to use project object model (POM), an XML representation of a Maven project held in a file named pom.xml. We therefore install maven first,

sudo apt install maven

The installation then proceeds as follows,

git clone https://github.com/charite/jannovar
cd jannovar
mvn package

Other tasks such as compile, test, etc. are also possible.

It is handy to use symbolic link, i.e.,

ln -s /home/jhz22/D/genetics/jannovar/jannovar-cli/target/jannovar-cli-0.24.jar $HOME/bin/Jannovar.jar
java -jar $HOME/bin/Jannovar.jar db-list
java -jar $HOME/bin/Jannovar.jar download -d hg19/refseq

We may need to set memory size, e.g.,

java -Xms2G -Xmx4G -jar $HOME/bin/Jannovar.jar

Melissa

https://github.com/andreaskapou/Melissa

MEthyLation Inference for Single cell Analysis (Melissa), is a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq). Melissa clusters individual cells based on local methylation patterns, enabling the discovery of epigenetic diversities and commonalities among individual cells. The clustering also acts as an effective regularisation method for imputation of methylation on unassayed CpG sites, enabling transfer of information between individual cells.

Kapourani C-A, Sanguinetti G (2019). Melissa: Bayesian clustering and imputation of single-cell methylomes, Genome Biology 20:61, https://doi.org/10.1186/s13059-019-1665-8

MINITE

https://github.com/Oshlack/MINTIE

Cmero, M. et al. MINTIE: identifying novel structural and splice variants in transcriptomes using RNA-seq data. Genome Biology 22, 296 (2021).

MONSTER

http://galton.uchicago.edu/~mcpeek/software/MONSTER/ (http://galton.uchicago.edu/~mcpeek/software/MONSTER/MONSTER_v1.3.tar.gz)

Jiang D, McPeek MS (2014). Robust Rare Variant Association Testing for Quantitative Traits in Samples with Related Individuals. Genetic Epidemiology 38(1):10-20

pindel

The software can be obtained from https://github.com/genome/pindel.

After htslib is installed, the canonical instruction is to issue

git clone https://github.com/samtools/htslib
cd htslib
make
sudo make install
cd -
git clone https://github.com/genome/pindel
cd pindel
./INSTALL ../htslib

It is 'standard' to have complaints about pindel.cpp, bddate.cpp and genotyping.cpp, for abs() rather than fabs() from the header file cmath have been used. The issue goes away when abs is replaced with fabs and in the case of bddata.cpp, it is also necessary to invoke the header, i.e.,

#include <cmath>

rtg-tools

It is available from https://www.realtimegenomics.com/products/rtg-tools and GitHub,

git clone https://github.com/RealTimeGenomics/rtg-tools.git
ant
dir dist

sambamba

While the source contains ldc2, it is readily available with Ubuntu archive nevertheless failed to compile, so we proceed with instructions at the GitHub, e.g.,

export PATH=$HOME/ldc2-1.10.0-linux-x86_64/bin:$PATH
export LIBRARY_PATH=$HOME/ldc2-1.10.0-linux-x86_64/lib

for version 1.10.0.

samtools

To build from source, we do these,

git clone https://github.com/samtools/htslib
cd htslib
make
cd -
git clone https://github.com/samtools/samtools
cd samtools
autoheader            # Build config.h.in (this may generate a warning about
                      # AC_CONFIG_SUBDIRS - please ignore it).
autoconf -Wno-syntax  # Generate the configure script
./configure           # Needed for choosing optional functionality
make
make install

Note bgzip and tabix are distributed with htslib. It is relatively easier to install from release,

wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
tar xjf samtools-1.9.tar.gz
cd samtools-1.9
./configure --prefix=/scratch/jhz22
make
make install
cd -
cd htslib-1.9
./configure --prefix=/scratch/jhz22
make
make install

where we install tabix as well.

SnpEff, SnpSift, clinEff

It is straightforward with the compiled version from sourceforge, which also includes clinEff.

wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
unzip snpEff_latest_core
cd snpEff
java -jar snpEff.jar databases
java -jar snpEff.jar download GRCh38.76
wget http://sourceforge.net/projects/snpeff/files/databases/test_cases.tgz
tar fvxz test_cases.tgz

lists all the databases and download a particular one. Later, the test files are also downloaded and extracted.

The following steps compile from source instead.

git clone https://github.com/pcingola/SnpEff.git
cd SnpEff
mvn package
mvn install
cd lib
# Antlr
mvn install:install-file \
    -Dfile=antlr-4.5.1-complete.jar \
    -DgroupId=org.antlr \
    -DartifactId=antlr \
    -Dversion=4.5.1 \
    -Dpackaging=jar

# BioJava core
mvn install:install-file \
    -Dfile=biojava3-core-3.0.7.jar \
    -DgroupId=org.biojava \
    -DartifactId=biojava3-core \
    -Dversion=3.0.7 \
    -Dpackaging=jar

# BioJava structure
mvn install:install-file \
    -Dfile=biojava3-structure-3.0.7.jar \
    -DgroupId=org.biojava \
    -DartifactId=biojava3-structure \
    -Dversion=3.0.7 \
    -Dpackaging=jar

cd -

# SnpSift
git clone https://github.com/pcingola/SnpSift.git
cd SnpSift
mvn package
mvn install

which gives target/SnpEff-4.3.jar and target/SnpSift-4.3.jar, respectively.

Note that antlr4 is from GitHub, https://github.com/antlr/antlr4. See also https://github.com/sanger-pathogens/SnpEffWrapper.

subread

It is available from http://subread.sourceforge.net/.

SViCT

Short for Structural Variant detrction in Circulating Tumor DNA and is available from

https://github.com/vpc-ccg/svict

tagdust

http://sourceforge.net/projects/tagdust/

Trinity

RNA-Seq De novo Assembly Using Trinity, https://github.com/trinityrnaseq/trinityrnaseq/wiki.

VarScan

Hosted at https://github.com/dkoboldt/varscan, the .jar files are ready to use with

git clone https://github.com/dkoboldt/varscan

or from the repository releases.

See http://varscan.sourceforge.net/ for further information.

vcftools

Assuming that we use zlib 1.2.8 from module zlib/1.2.8, we can do the following,

wget https://github.com/vcftools/vcftools/releases/download/v0.1.16/vcftools-0.1.16.tar.gz
tar xvfz vcftools-0.1.16.tar.gz
module load zlib/1.2.8
./configure --prefix=/scratch/jhz22 ZLIB_CFLAGS="-I/usr/local/Cluster-Apps/zlib/1.2.8/include" ZLIB_LIBS="-L/usr/local/Cluster-Apps/zlib/1.2.8/lib -lz"
make
make install

To use vcf-concat, it is necessary to set the PERL5LIB environment variables, e.g.,

export PERL5LIB=/scratch/jhz22/share/perl5

WASP

Allele-specific pipeline for unbiased read mapping and molecular QTL discovery, https://github.com/bmvdgeijn/WASP/.