The repository shows how to provide input for PW-pipeline, FM-pipeline including GCTA, among others; it also shows how to generate SNPid-RSid pairs which would be required for consortium meta-analysis.
Briefly, the format has the following columns,
Column | Name | Description |
---|---|---|
1 | SNP | RSid |
2 | A1 | Effect allele |
3 | A2 | Other allele |
4 | freqA1 | A1 frequency |
5 | beta | effect estimate |
6 | se | standard error of effect |
7 | P | P-value |
8 | N | sample size |
9* | chr | chromosome |
10* | pos | position |
* see next section.
It is a personal choice to use either UCSC or dbSNP.
The chromosomes and positions can be obtained from https://genome.ucsc.edu/, which should be helpful for GWAS summary statistics either using chromosomal positions from different builds or without these at all.
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/snp150.txt.gz
gunzip -c snp150.txt.gz | \
awk '{split($2,a,"_");sub(/chr/,"",a[1]);print a[1],$4,$5}' | \
sort -k3,3 > snp150.txt
where it first obtains build 37 positions1, sorts them by RSid into the file snp150.txt
. More flexiblly,
we can do as in TWAS-pipeline
on refGene by selecting appropriate columns,
# from the MySQL database
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select * from snp150' > snp150.txt
# into a MySQL database
gunzip -c snp150.txt.gz > snp150.txt
(
wget -qO- http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/snp150.sql
echo load data local infile 'snp150.txt' into table snp150;
) > snp150.sql
while snp150.sql is amended. The option -A on the MysQL command line makes it faster and -D specifies database. More generally, it is possible to use MySQL interactively,
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A <<END
show databases;
use hg19;
show tables;
describe snp150;
select * from snp150 LIMIT 1,30;
select chrom, chromStart, chromEnd, strand, name, refNCBI, observed from snp150 WHERE name="rs548419688";
END
which illustrate some useful commands.
The RSid – SNPid (chromosome:position_allele1_allele2 such that allele1 < allele2) pairing can be achieved as follows,
gunzip -c snp150.txt.gz | \
cut -f2,4,5,10 | \
awk '$1~/^chr[0-9]+$|^chrX$|^chrY$/{if(!index($4,"-")) {split($4,a,"/"); print $1 ":" $2 "_" a[1] "_" a[2], $3}}' | \
sort -k1,1 | \
gzip -f > snp150.snpid_rsid.gz
The script below downloads SNP154, replacing Genomic accession numbers (GANs) with familiar chromosome names via ctreepo, https://github.com/vkkodali/cthreepo and indexing the output via tabix2.
module load ceuadmin/bcftools
module load ceuadmin/htslib
source ~/rds/public_databases/software/py38/bin/activate
pip install cthreepo
# hg19
wget https://ftp.ncbi.nih.gov/snp/archive/b154/VCF/GCF_000001405.25 -O snp154_GCF_000001405.25
cthreepo --infile snp154_GCF_000001405.25 --id_from rs --id_to uc --format vcf --mapfile h37 --outfile snp154_hg19.vcf
bgzip -f snp154_hg19.vcf
tabix -p vcf -f snp154_hg19.vcf.gz
# hg38
wget https://ftp.ncbi.nih.gov/snp/archive/b154/VCF/GCF_000001405.38 -O snp154_GCF_000001405.38
cthreepo --infile snp154_GCF_000001405.38 --id_from rs --id_to uc --format vcf --mapfile h38 --outfile snp154_hg38.vcf
bgzip -f snp154_hg38.vcf
tabix -p vcf -f snp154_hg38.vcf.gz
# index snp154_GCF.*
bgzip -f snp154_GCF_000001405.25
tabix -p vcf snp154_GCF_000001405.25.gz
bgzip -f snp154_GCF_000001405.38
tabix -p vcf snp154_GCF_000001405.38.gz
It refers to the reference sequences for each chromosome, often part of databases such as GenBank or the Genome Reference Consortium (GRC).
The information is available for instance in snp154_GCF_000001405.*
above.
Chromosome | hg19 (GRCh37) | hg38 (GRCh38) |
---|---|---|
1 | NC_000001.10 | NC_000001.11 |
2 | NC_000002.11 | NC_000002.12 |
3 | NC_000003.11 | NC_000003.12 |
4 | NC_000004.11 | NC_000004.12 |
5 | NC_000005.9 | NC_000005.10 |
6 | NC_000006.11 | NC_000006.12 |
7 | NC_000007.13 | NC_000007.14 |
8 | NC_000008.10 | NC_000008.11 |
9 | NC_000009.11 | NC_000009.12 |
10 | NC_000010.10 | NC_000010.11 |
11 | NC_000011.9 | NC_000011.10 |
12 | NC_000012.11 | NC_000012.12 |
13 | NC_000013.10 | NC_000013.11 |
14 | NC_000014.8 | NC_000014.9 |
15 | NC_000015.9 | NC_000015.10 |
16 | NC_000016.9 | NC_000016.10 |
17 | NC_000017.10 | NC_000017.11 |
18 | NC_000018.9 | NC_000018.10 |
19 | NC_000019.9 | NC_000019.10 |
20 | NC_000020.10 | NC_000020.11 |
21 | NC_000021.8 | NC_000021.9 |
22 | NC_000022.10 | NC_000022.11 |
X | NC_000023.10 | NC_000023.11 |
Y | NC_000024.9 | NC_000024.10 |
The “NC_” prefix stands for “nucleotide accession” and is part of the standard naming convention used in genomics databases.
An observation can be made such that the chromosome number in column 1 corresponds to the middle field between the _ and . delimiters in column 2 so would be extracted with awk -F'[._]' '{print $2+0}'
but as a column in a file requires different handling (below).
The counterpart for RSid – SNPid pairing can be furnished as follows,
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%SAMPLE=%GT]\n' snp154_GCF_000001405.25.gz | \
awk '{gsub(/NC_[0]+|\.[0-9]+/,"",$1);print $3,$1":"$2":"$4":"$5}'
It is also handy to append -r NC_000001.10:10001
in the bcftools query for a specific region.
Besides standard chromosomal positions, it also has other categories3,
- Random contigs (e.g., chrY_KI270740v1_random). Unlocalized sequences that are known to originate from specic chromosomes, but whose exact location within the chromosomes is not known (e.g., chr9_KI270720v1_random).
- ChrUn (e.g., chrUn_GL000218v1). Unplaced sequences that are known to originate from the human genome, but which cannot be condently placed on a specic chromosome.
- EBV (e.g., -). Epstein-Barr virus sequence, representing the genome of Human herpes virus 4 type 1, the cause of infectious mononucleosis. This disease results in fever, sore throat, and enlarged cervical lymph nodes. About 98% of adults have been exposed to the virus by the age of 40 years. Since the virus remains latent in the body after infection, it is very common to nd EBV sequences when performing human genome sequencing.
- HLA (e.g., HLA-C*01:08). Human leukocyte antigen (HLA) sequences representing selected alleles of the HLA A, B, C, DQA1, DQB1, and DRB1 loci. The HLA loci encode the major histocompatibility complex (MHC) proteins and are highly variable in the population.
- Decoy sequences (e.g., chrUn_KN707606v1_decoy). A major motivation for including the \decoy” sequences in the reference genome is that if a sample actually contains a genomic segment that is not in the reference assembly, aligners may spend a lot of CPU time trying to nd a good match, or worse, aligners may wrongly assign the reads with a low mapping quality to segments with similar sequences in the reference genome. If the segment matches to one of the decoy contigs and the decoy is included in the reference, then the segment will quickly be assigned to the decoy and prevent the aligner from uselessly searching for other matches. Thus, the decoy sequences are a pragmatic solution to this, contain EBV and human sequences that in eect \catch” reads that would otherwise map with low quality to other regions in the reference and lead to unnecessary computation and avoid false variant calls related to false mappings.
- Alternate contigs (e.g., chr3_KI270778v1_alt). Alternative sequence paths in regions with complex structural variation in the form of additional locus sequences.
These are only for illustrative purpose, as it may be preferable to use full GWAS summary statistics in the case of T2D and proteogenomic examples.
We take data reprted by Locke, et al. (2015) as example which requires build 37 positions from UCSC described above.
# GWAS summary statistics
wget http://portals.broadinstitute.org/collaboration/giant/images/1/15/SNP_gwas_mc_merge_nogc.tbl.uniq.gz
gunzip -c SNP_gwas_mc_merge_nogc.tbl.uniq.gz |
awk 'NR>1' | \
sort -k1,1 | \
join -11 -23 - snp150.txt | \
awk '($9!="X" && $9!="Y" && $9!="Un")' | \
gzip -f > bmi.tsv.gz
where file containing the GWAS summary statistics is downloaded, its header dropped, sorted and positional information added leading to a compressed file named bmi.tsv.gz. We also filter out nonautosomal SNPs.
The list of 97 SNPs can be extracted as follows,
R --no-save <<END
library(openxlsx)
xlsx <- "https://www.nature.com/nature/journal/v518/n7538/extref/nature14177-s2.xlsx"
snps <- read.xlsx(xlsx, sheet = 4, colNames=FALSE, skipEmptyRows = FALSE, cols = 1, rows = 5:101)
snplist <- sort(as.vector(snps[,1]))
write.table(snplist, file="97.snps", row.names=FALSE, col.names=FALSE, quote=FALSE)
END
The list is SNPs is contained in 97.snps.
As described elsewhere, we are rather tempted to use a distance-based approach for independent signals,
(
awk -vOFS="\t" 'BEGIN{print "MarkerName", "A1", "A2", "freq", "Effect", "StdErr", "P.value", "N", "Chrom", "End"}'
zcat bmi.tsv.gz | \
sed 's/ /\t/g' | \
awk '(NR>1 && $7<=2.4e-7)' | \
sort -k9,9n -k10,10n
) > bmi.dat
R --no-save -q < bmi.R > bmi.out
grep -f 97.snps bmi.out | \
wc -l
with bmi.R
as follows,
options(echo=FALSE)
bmi <- read.delim("bmi.dat",as.is=TRUE)
require(reshape)
require(gap)
chrs <- with(bmi,unique(Chrom))
for(chr in chrs)
{
# print(chr)
ps <- subset(bmi[c("Chrom","End","MarkerName","Effect","StdErr","P.value")],Chrom==chr)
row.names(ps) <- 1:nrow(ps)
sentinels(ps,chr,1)
}
and the output is bmi.out. One may wonder the overlap between the two, and the answer is 69. In this case, awk has no problem to handle the small P value, i.e., with
(
awk -vOFS="\t" 'BEGIN{print "MarkerName", "A1", "A2", "freq", "Effect", "StdErr", "P.value", "N", "Chrom", "End"}'
zcat bmi.tsv.gz | \
sed 's/ /\t/g' | \
awk '
function abs(x)
{
if (x < 0) return -x;
else return x;
}
(NR>1 && abs($5/$6)>=5.165342)' | \
sort -k9,9n -k10,10n
) > bmi.dat
R --no-save -q < bmi.R > bmi.out
grep -f 97.snps bmi.out | \
wc -l
we still have 69.
The data was reported by Scott, et al. (2017),
R -q --no-save <<END
library(openxlsx)
library(dplyr)
xlsx <- "http://diabetes.diabetesjournals.org/highwire/filestream/79037/field_highwire_adjunct_files/1/DB161253SupplementaryData2.xlsx"
# Supplementary Table 3. Results for established, novel and additional distinct signals from the main analysis.
ST3 <- read.xlsx(xlsx, sheet = 3, colNames=TRUE, skipEmptyRows = FALSE, cols = 1:20, rows = 2:130) %>%
rename(P="p-value.in.stage.1") %>% within(
{
beta=log(OR)
L <- as.numeric(substr(CI,1,4))
U <- as.numeric(substr(CI,6,9))
se=abs(log(L)-log(U))/3.92
}) %>% select(
SNP=rsid,
A1=EA,
A2=NEA,
freqA1=EAF,
beta,
se,
P,
N=Sample.size,
chr=Chr,
pos=Position_b37
)
write.table(ST3, file="ST3", row.names=FALSE, col.names=FALSE, quote=FALSE)
# Supplementary Table 4. BMI-unadjusted association analysis model
ST4 <- read.xlsx(xlsx, sheet = 4, colNames=TRUE, skipEmptyRows = FALSE, cols = 1:12, rows = 3:132) %>% rename(
"CI"="CI.95%",
"P"="P-value") %>% within(
{
beta=log(OR)
L <- as.numeric(substr(CI,1,4))
U <- as.numeric(substr(CI,6,9))
se=abs(log(L)-log(U))/3.92
P=2*(1-pnorm(abs(beta/se)))
}) %>% select(
SNP=rsid,
A1=allele1,
A2=allele2,
freqA1=freq1,
beta,
se,
P,
N,
chr,
pos=position_b37
)
write.table(ST4, file="ST4", row.names=FALSE, col.names=FALSE, quote=FALSE)
END
where we generate data based on the paper’s supplementary tables ST3 and ST4; the former is in line with the paper (by
specifying _db
=depict and p_threshold
=0.00001 when calling PW-pipeline).
The data was reported by Sun, et al. (2018),
R -q --no-save <<END
require(openxlsx)
xlsx <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-018-0175-2/MediaObjects/41586_2018_175_MOESM4_ESM.xlsx"
# Supplementary Table 4
ST4 <- read.xlsx(xlsx, sheet=4, colNames=TRUE, skipEmptyRows=FALSE, cols=c(6:8,11:13,23:25), rows=6:1986)
names(ST4) <- c("SNP","Chr","Pos","A1","A2","EAF","b","se","p")
write.table(ST4, file="plamsprotein", row.names=FALSE, col.names=FALSE, quote=FALSE)
END
and Supplementary Table 4 is fetched here.
lz.sh is a script which extracts information on SNP and their positions from LocusZoom 1.4 database.
It might be worthwhile to check for options with the sumstats as defined in ldsc, https://github.com/bulik/ldsc, and particularly its munge_sumstats.py
utility.
One may prefer to use the following repository to convert GWAS sumstats for downstream processing,
https://github.com/MRCIEU/gwas2vcf
The VCF files generated can be converted to the format here with command
bcftools query -f '%ID\t%ALT\t%REF\t%AF\t[%ES]\t[%SE]\t[%LP]\t[%SS]\t%CHROM\t%POS\n' my.vcf.gz | \
awk -vOFS="\t" '{$7=10^-$7};1'> my.tsv
The software suite now has its own liftover facility, as in https://doi.org/10.1093/bioinformatics/btae038.
A relatively earlier repository with specific focus on download of sumstats is as follows,
https://github.com/mikegloudemans/gwas-download
The OpenTargets harmoniser is available from here, https://github.com/opentargets/genetics-sumstat-harmoniser.
UCSC FAQ on data file formats, https://genome.ucsc.edu/FAQ/FAQformat.html
GIANT (Genetic Investigation of ANthropometric Traits) data
Locke AE, et al. (2015) Genetic studies of body mass index yield new insights for obesity biology. Nature 518(7538):197-206. doi: 10.1038/nature14177
DIAGRAM (DIAbetes Genetics Replication And Meta-analysis) data
Scott R, et al. (2017) An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes 66:2888–2902.
Protein website at Cardiovascular Epidemiology Unit (CEU)
Sun BB, et al. (2018) Genomic atlas of the human plasma proteome. Nature 558: 73-79.
GWAS catalog summary statistics API, https://www.ebi.ac.uk/gwas/summary-statistics/docs/
Build 36, https://hgdownload.soe.ucsc.edu/goldenPath/hg18/database/ (snp[126–30].sql, snp[126–30].txt.gz) ↩
Alternatively, tabix can also be used as
tabix -S38 -s1 -b2 -e2 -f snp154_hg19.vcf.gz
tabix -S38 -s1 -b2 -e2 -f snp154_hg38.vcf.gz
Robinson PN, Piro RM, Jager K (2018). Computational Exome and Genome Analysis, CRC. ↩