2LD – two-locus LD calculator
JH Zhao 15/02/2002
Program description
Running 2ld
File listing
Known problems
Acknowledgements
Contact information
References
Program description
2LD is a simple program calculating linkage disequilibrium (LD) measures between two polymorphic markers. It is developed as a supplement for an Excel calculator by Dr David Collier, adding standard errors and handling multiallelic markers.
Denote the number of alleles at markers 1 and 2 as m and n, the observed data can be organized into a contingency table of m(m+1)/2 x n(n+1)/2 cells. However table can be parameterised by only m x n possible haplotype frequencies. As illustrated by ASSOCIATE, an ordinary chi-squared statistic will have [m(m+1)/2-1][n(n+1)/2-1] degrees of freedom and this test of independence may imply dependence other than allelic association. The haplotypes at linkage equilibrium can be obtained from allele frequencies of each marker, while assuming LD they are estimated using a gene-counting procedure. Alternatively, haplotype frequency estimates from other sources can also be used. In addition, the program gives chi-squared test of the estimated haplotype frequencies, as a global measure of LD, i.e. Phi coefficient and Cramer’s V. However this will need asymptotic approximation, Fisher’s exact test may be more appropriate for its significance. Finally 2ld calculates another popular measure of LD, D’ and standard error. As yet 2ld does not handle missing genotypes.
Computer programs ASSOCIATE and EH also implement gene-counting procedure, and are available from http://linkage.rockefeller.edu.
For more details see Klitz et al. (1995), Long et al. (1995), Weir (1996) and other references listed below. To obtain permutation-based LD measures use PM or FASTEH+ at this site.
Running 2ld
2ld will accept an input file in one of the three formats.
Format 1. the raw genotype data in consecutive lines, each line containing five columns as follows.
Column 1. Individual ID
Column 2, 3. Marker 1 genotypes
Column 4, 5. Marker 2 genotypes
File genotype.dat is in this format.
Format 2. genotype table as required by ASSOCIATE and EH programs
Line 1. < No of alleles at marker 1> <No of alleles at marker 2>
Line 2- actual two-locus genotype counts
File genotype.tab is in this format. Note that since the two markers have alleles 14 and 2, the genotype table has 14(14+1)/2 x 2(2+1)/2 = 105 x 3 cells.
Format 3. observed haplotype frequencies from any other sources.
Line 1.
<# of alleles at marker 1> <# of alleles at marker 2> <# of subjects>
Line 2-
haplotype frequencies
File genotype.eh, 2by2.dat, kbyl.dat are in this format. Note if haplotype frequency estimates are based on EH then the column “w/association” in the output has to be used. Since 2ld was designed for diploid individuals, the number of haplotypes is equal to twice the number of subjects.
The syntax of 2ld is as follows,
2ld
We use convention of DOS that ``data file’’ within angled bracket is needed; otherwise the program only gives the internal example. The screen output can be redirected to a ASCII file by using “> output file name”. For a case-control study three separate input files can be created, for cases only, controls only and combined. To run 2ld on 2by2.dat the command is
2ld 2by2.dat
to output on computer screen,
or
2ld 2by2.dat >2by2.out
to output to file 2by2.out
File listing
2ld.c is in ANSI C so should be portable to Unix/Linux. 2ld.exe is created from 2ld.c by Symantec C/C++ 7.2 compiler with command “sc -mn 2ld.c”.
Known problems
Q
2ld.exe always gives internal example if clicking 2ld.exe from Win9x, when I press
A
Since 2ld is a DOS-based program, Windows users will have to enter MS-DOS Prompt first. For example under Win9x, click Start -> Program -> MS-DOS Prompt, change to the appropriate directory and issue 2ld command described above. I put 2ld.exe in c:\iop\2ld directory, so my commands would be
cd c:\iop\2ld
2ld 2by2.dat >2by2.out
edit 2by2.out
Alternatively I can add 2ld directory by DOS command
set path=c:\iop\2ld;%path%
then I can use 2ld anywhere under DOS.
Q
Why my degree(s) of freedom differ from EH ?
A
2LD calculates degree(s) of freedom from alleles present, rather than user-specified.
Q
I get two D’s, one negative and one positive, for my biallelic data, why?
A> Q
I get two D’s, one negative and one positive, for two biallelic data, why?
A
When there are 2 biallelic markers involved, let p_1, p_2 be allele frequencies
at locus 1, and q_1, q_2 be allele frequencies at locus 2. The first D’ is
given under “Disequilibria, expectations and variances” as D’_{11}, i.e., D’
for haplotype 11, while the second D’ is obtained as follows,
p_1q_1 | D’_{11} | +p_1q_2 | D’_{12} | +p_2q_1 | D’_{21} | +p_2q_2 | D’_{22} |
They happen to have the same quantity but may differ in signs.
Q
How to get more sense of the global chi-square ?
A
A chi-square based on the estimated haplotype frequency table, as a direct test of global disequilibrium. A simple check can be done with standard packages such as SAS. For example the following program computes chi-square and Fisher’s exact test for a biallelic marker and a three-allele marker from 21 diploid individuals (data kindly from Dr Zhicheng Lin).
data abc;
input a b w;
w=round(w42);
cards;
1 1 0.095234
1 2 0.000005
1 3 0.047618
2 1 0.238099
2 2 0.571423
2 3 0.047621
;
proc freq;
weight w;
table ab/chisq exact;
run;
Unfortunately there is no degree of freedom for a test between this and that from EH, for both chi-squares have (m-1)(n-1) degrees of freedom.
Q
I have 5 loci in my data set, and 6 populations. Is it possible to use 1 working file and include parameters for which pop and locus pair to analysis?
A (this is now part of LDSHELL)
I imagined it was much of a work to have everything in one go and didn’t implement it. Now I spent >1 days to integrate my program fragments and attach it to you. As usual you will need unzip it first. I wrote it in a haste so please let me know if there is any problem with it.
Basically you can organize your subjects as follows,
id popid m11 m12 m21 m22 m31 m32 m41 m42 m51 m52
where id is any subject id, popid is a population a subject belongs to, and m’s are the marker genotypes/alleles.
Now put ldshell.exe in the same directory as 2LD and type
ldshell yourfilename 5
then check pop##.out for results of each population (## is the population label).
Included also are a file called hla.dat which is in the format aforementioned. Since it has control and case information labelled as 0 and 1, the output files are pop0.out and pop1.out, which are generated using command
ldshell hla.dat 3
To see D’ coefficient grep.exe is useful (Under Unix/Linux both unzip and grep are internal):
grep “coefficient =” pop?.out
POP0.OUT: D’ coefficient = 0.970663, SD = 0.0090 (Var = 0.000081)
POP1.OUT: D’ coefficient = 0.824925, SD = 0.0265 (Var = 0.000701)
It should be a good idea to do some random checks to make sure.
Acknowledgements
A list of people who help to make the program/documentation more useful.
David Collier sphadco@iop.kcl.ac.uk
Zhicheng Lin ZLIN@intra.nida.nih.gov
Mustafa Neamatallah m_neamatallah@hotmail.com
Carlos Zapata bfcazaba@usc.es
Barry Chioza spbcaja@iop.kcl.ac.uk
Maria Arranz m.arranz@iop.kcl.ac.uk
Tan Hui Hui Jenny medp9193@nus.edu.sg
Contact information
Please let me know your problem or comments by e-mailing j.zhao@iop.kcl.ac.uk or by post to
Jing Hua Zhao
Section of Genetic Epidemiology & Biostatistics
Division of Psychological Medicine
Institute of Psychiatry
De Crespigny Park
London SE5 8AF
UK
References
Abramowitz M and Stegun IA (1968) Handbook of mathematical functions. New York
Bishop YMM, Fienberg SE, Holland PW (1975) Discrete Multivariate Analysis – Theory and Practice, The MIT press
Cramer H (1946) Mathematical Methods of Statistics. Princeton Univ. Press
Klitz W, Stephen JC, Grote M, Carrington M (1995) Discordant patterns of linkage disequilibrium of the peptide transporter loci within the HLA class II region. Am. J. Hum. Genet. 57:1436-1444
Long JC, Williams RC, Unbanek M (1995) An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56:799-810
Weir BS (1979) Inferences about linkage disequilibrium. Biometrics 35:235-254
Weir BS (1996) Genetic Data Analysis II. Sinaur
Xie X, Ott J (1993) Testing linkage disequilibrium between a disease gene and marker loci. Am J Hum Genet 53:1107
Zapata C, Alvarez G, Carollo C (1997) Approximate variance of the standardized measure of gametic disequilibrium D’. Am. J. Hum. Genet. 61:771-774
Zapata C, Carollo C, Rodriguez S (2001) Sampling variance and distribution of the D’ measure of overall gametic disequlibrium between multiallelic loci. Ann Hum Genet 65:395-406