HaploBlockFinder is distributed under the GNU GPL license. It can be
freely distributed and modified (I hope to be informed in this case),
but
WITHOUT ANY WARRANTY.
INSTALLATION
The codes were written with ANSI C, and has been successfully compiled
under several operating systems, including Tru64 Unix, Solaris, Linux
and Windows.
Here are two examples:
(1) Tru64 Unix with cc:
cc HaploBlockCore.c HaploBlockFinder.c
HaploBlockReport.c -o haploBlockFinder -lm -fast
(2) Other Unix with cc:
cc HaploBlockCore.c HaploBlockFinder.c
HaploBlockReport.c -o haploBlockFinder -lm -O3
(3) Unix with gcc:
gcc HaploBlockCore.c HaploBlockFinder.c
HaploBlockReport.c -o haploBlockFinder -lm -O3
The package also includes three perl scripts for format conversion and
diagram
generation, and another one to run the program on a web server. In
order to
use the drawBlocks.pl and the showLDmatrix.pl to draw a graphical
diagram,
you need Perl and two perl modules, GD and Getopt::Long. In windows,
you
may download ActivePerl from http://www.activestate.com/ , install Perl
and
install the GD and Getopt::Long using the ppm3 program distributed with
ActivePerl.
For Unix system, you need to download the GD and Getopt::Long module
from
http://www.cpan.org , or install it using the cpan utility. In some
Linux
system, the default Perl might have some problems, in which case the
Perl
need to be recompiled with the latest source code. For the detailed
usage
of these perl script, read the documentation imbeded in the perl
scripts
using the perldoc utility.
HOW TO USE IT
To run haploBlockFinder, a text file contains haplotype data is
required
as input. Haplotypes should be in a FASTA like format as below:
>haplotype1
<---- name of haplotype, less than 50 characters
AGTCGTGTGTCTGTGCTGTG
<-----haplotype
Haplotypes can be represented as base code (A,C,G,T,N,I [insert],D
[deletion])
or numerical code (0, 1, 2, 3...9; 0 represents ambiguous call in this
case).
An example haplotype files "sample_haplotypes_fasta.txt" is included.
An
optional file contains locus ID and locus position can be also
submitted
to the program, so that some statistics, like average block size can be
easily
calculated from the report files. This optional file should have only
two
column: The first one is the locus ID, and the second one is the locus
position
(bp). If this position file is not provided, the program assumes a
default
density of 1 SNP/Kb. There should be NO HEADER line in the file. See
"locusInfo.txt"
for example.
A note on haplotype format:Starting from Version
0.4,
haplotype
data in a format, such as the one generated by Merlin (see
sample_haplotypes_Merlin.txt),
is accepted. Just run the wrapper program hbfWrapper.pl instead of
the
haploBlockFinder
(see Example #5). From Version 0.6, the web-based
haploBlockFinder
supports the haplotype file generated by Arlequin and PHASE.(If you
wish to use the downloaded
program on your local machine, there are two subroutines in the
haploBlockFinder.cgi,
called Arlequin2Fasta and Phase2Fasta, for reformatting, which can be
easily
pulled out and used as
a standalone program).
Arguments accepted by this program are:
-A: block identification algorithm
1 Minimal |D'| range: A block is
defined as a region in which pair-wise |D’| values between
all SNPs pairs exceed a certain threshold
2 Chromosomal coverage: A block is
defined as a region in which certain percentage of chromosomes is
represented by less than 4 common haplotypes
3 Four-gamete test: A block is
defined as a region in which no within block recombination is found
based on the four-gamete test
4
Minimal d^2 range: A block is defined as a region in which
pair-wise
d^2 values between all SNPs pairs exceed a certain threshold
5
Minimal LLR range: A block is defined as a region in which
pair-wise
LLR values between all SNPs pairs exceed a certain threshold
6 Minimal r^2 range: A block is
defined as a region in which pair-wise r^2 values between
all SNPs pairs exceed a certain threshold
-D: threshold value of LD measurements, default
value
is 0.8 for |D'|, 0.3 for d^2 and 1.0 for LLR. (Optional)
-C: threshold value of chromosome coverage, default
value is 0.8. (Optional)
-Q: threshold value for the quality filter, default
value is 0.5. The quality filter. (Optional)
masks loci that has
ambiguous/total ratio exceeding the given threshold, since
these loci tends to cause
adverse effect on block partitioning.
-I: file that contains haplotypes
-L: file that contains locus name and position
(Optional)
-O: directory for report files. (Optional)
-H: higher bound of minor allele frequency for the allele
frequency filter (Optional)
-B: lower bound of minor allele frequency for the
allele frequency filter (Optional)
-M: LD matrix is generated if set as 1. (Optional,
required for showLDmatrix.pl)
-T: Identify htSNPs if set as 1. (Optional, could be
slow if the number of SNPs in a block is large)
-P: portion of chromosomes that can be distinguished
with htSNPs, default value is 0.8 (Optional)
Note on haplotype block finding based on chromosome coverage:
This program excludes ambiguous haplotypes (haplotypes with more than
one "N") for the calculation of chromosome coverage. Besides, blocks
that have over 50% of chromosomes with ambiguous haplotypes will not be
evaluated.
Example 1:
[kzhang@cgi kzhang]$./haploBlockFinder -A1 -D0.9
-Isample_haplotypes_fasta.txt
-LlocusInfo.txt -Orpt
To find blocks with minimal LD range using 0.9 as threshold, haplotype
file is
"sample_haplotypes_fasta.txt", and the locus informatio file is
"locusInfo.txt",
the report files
will be in the ./rpt directory. Note that this directory should be
created first, otherwise
the program won't run correctly.
Example 2:
[kzhang@cgi kzhang]$./haploBlockFinder -A2 -C0.7
-Isample_haplotypes_fasta.txt
To find blocks with chromosome coverage threshold as 0.7, haplotype
file is
"sample_haplotypes_fasta.txt". The report files will be in the current
directory.
Example 3:
[kzhang@cgi kzhang]$./haploBlockFinder -A3 -Q0.6
-Isample_haplotypes_fasta.txt
To find blocks with four-gamete test, haplotype file is
"sample_haplotypes_fasta.txt". Loci that have
over 60% of ambiguous calls are masked.
Example 4:
[kzhang@cgi kzhang]$./haploBlockFinder -A1 -D0.8
-Isample_haplotypes_fasta.txt
-LlocusInfo.txt -B0.15 -M1 -T1
To find blocks with minimal LD range using 0.8 as threshold, haplotype
file is "sample_haplotypes_fasta.txt", and the locus informatio file is
"locusInfo.txt",
common SNPs with minor allele frequency higher than 15% are selected,
LD matrix
is generated, and htSNPs are identified.
Example 5:
[kzhang@cgi kzhang]$./hbfWrapper.pl -A4 -D0.5
-Isample_haplotypes_Merlin.txt
-M1
To find blocks with minimal d^2 range using 0.5 as threshold, haplotype
file is in a format generated by Merlin, and the locus informatio file
is "locusInfo.txt", LD matrix is generated.
HOW TO READ REPORTS
The program will generate report in text files (tab delimited), and
two graphic files:
(1)haploblock_blocks.txt: report the features and
statistics for each blocks.
Details for each column:
(a)Block ID: Internal block ID
(b)Start Locus: The first locus
of a block
(c)End Locus: The last locus of
a block
(d)#loci: The number of loci
within the block
(e)Start Pos: The starting
chromosomal position (bp) of the block, represented by the position of
the first loci
(f)End Pos: The ending chromosomal
position (bp) of the block, represented by the position of the last loci
(g)Length: Length of the block
(h)minimal D': The minimum pair-wise D' value
between
any pair of loci with the block
(i)minimal d^2: The minimum pair-wise d^2 value
between
any pair of loci with the block
(j)minimal r^2: The minimum pair-wise r^2 value
between any pair of loci with the block
(k)minimal LLR: The minimum pair-wise LLR (log
likelihood ratio) between any pair of loci with the block
(l)Ho/He: The ratio between
observed heterozygosity and expected heterozygosity. This measurement
can represent
how diversity is
reduced within a haplotype block. Please note that in this application
heterozygosity is calculated based on haplotype frequency NOT the
average of heterozygosity
of
all loci within the block. Here is an example of the calculation:
Suppose we
have a three-locus block with three observed haplotypes:
AAA 65
TTT 29
TAA 2
Then the
haplotype frequencies are:
p(AAA) = 65/96
= 0.677
p(TTT) = 29/96
= 0.302
p(TAA) = 2/96
= 0.021
Ho =
1-p(AAA)^2 - p(TTT)^2 - p(TAA)^2 = 0.5495
The
measurement He is calculated assuming the three loci is in complete
equilibrium.
In this case,
p(AAA) =
p1(A)*p2(A)*p3(A) = 0.677*0.698*0.698 = 0.3298
p(AAT) =
p1(A)*p2(A)*p3(T) = 0.677*0.698*0.302 = 0.1427
p(ATA) =
p1(A)*p2(T)*p3(A) = 0.677*0.302*0.698 = 0.1427
p(ATT) =
p1(A)*p2(T)*p3(T) = 0.677*0.302*0.302 = 0.0617
p(TAA) =
p1(T)*p2(A)*p3(A) = 0.323*0.698*0.698 = 0.1574
p(TAT) =
p1(T)*p2(A)*p3(T) = 0.323*0.698*0.302 = 0.0681
p(TTA) =
p1(T)*p2(T)*p3(A) = 0.323*0.302*0.698 = 0.0681
p(TTT) =
p1(T)*p2(T)*p3(T) = 0.323*0.302*0.302 = 0.0295
He =
1-p(AAA)^2-p(AAT)^2-p(ATA)^2-p(ATT)^2-p(TAA)^2-p(TAT)^2-p(TTA)^2-p(TTT)^2
= 0.8118
Finally, we
have,
Ho/He = 0.5495/0.8118 = 0.6769
(l) # HtSNPs: Minimal number of
tagging
SNPs that can distinguish each haplotypes a block. Please refer to the
log at 11-02-2002 regarding the detailed information of htSNP
selection!
Note that in order to calculate the average block
size, just load this file into a spreadsheet program,
add a new column which is
defined as column(f)-column(e)+1, and take an average of the new
column.
Locus information file is
required in this case so that the program knows chromosomal position for
each locus.
(2)haploblock_haplotypes.txt: report all haplotype for
each block. The haplotypes are presented in two format: (i)haplotype
with
original coding; (ii) haplotypes with internal numerical coding, where
0
represents ambiguous call, 1 represents the allele with higher
frequency,
and 2 represents the rare allele. The number of chromosomes represents
by
each haplotype is also reported.
(3)haploblock_map.txt: report the position of blocks.
(4)haploblock_chromosomes.txt: A report summurizes all
haplotypes
in the input file, and show these haplotypes in block, as well as using
HtSNPs
only.
(5)blocks.png: a graphical diagram of haplotype blocks.
Up to four common haplotypes are shown, and the colors represent their
frequencies..
(6)LD_matrix.png: a graphical diagram to compare LD
pattern with haplotype blocks. Please note that if the position of SNPs
are not
submitted to the haploBlockFinder program, this Perl script assumes all
SNPs are evenly spaced.
In this diagram, the set of
haplotype
blocks identified by haploBlockFinder are shown on the top side and the
left
side. Short little white lines on the top and left represents the
position
of SNPs. The top right part of the matrix represent the Log Likelihood
Ratio
(LLR) between observed haplotype frequencies and expected haplotype
frequencies under the assumption of equilibrium. The bottom left part
of the matrix
represent the pair-wise |D'| values.
LLR indicates the
significant
level of association between two loci. It is calculated as below:
LLR =
n_AB*log10(P_AB/P_A*P_B) + n_Ab*log10(P_Ab/P_A*P_b) +
n_aB*log10(P_aB/P_a*P_B) + n_ab*log10(P_ab/P_a*P_b)
(7)log.txt: If you have problem in using the program on
our
website, read this file first!
History:
04-22-2004 The version
0.7 was released. There are two major changes in V0.7 compared with
V0.6c:
(1) One
additional LD statistics r^2 was included for block searching.
(2) The
showLDMatrix.pl program has been rewritten, such that any combination
of two out of the four LD statistics (|D'|, d^2, r^2, LLR) can be
displayed in a single diagram.\
Besides, the
report file haploblock_blocks.txt has one additional column for r^2
values.
04-16-2004 A bug that
leads to incorrect color coding in the LD diagram for all LLR values
was reported by Dr. Roger Vallejo (rvallejo@psu.edu). It has been fixed.
11-24-2003 A bug that
leads to "segmentation fault" with the -T option is not used was
identified and fixed.
05-19-2003 The version
0.6 was released, with a few changes:
(1)
Two measurements for linkage disequilibrium (d^2 and LLR) were included
for block searching, because there has been quite a number of reports
showing that |D'| might not be a good measurement for association
study. LLR and d^2 were incorporated into this program so that it
is very convenient to compare blocks defined by different LD
measurements. However, I have not seen any publication using them for
block definitions. So use them cautiously.
(2)
Inferred haplotypes generated by the PHASE program is accepted by
haploBlockFinder in this version (suggested by Dr. Cyrus Zabetian
<zabetian@u.washington.edu>). Files generated by other haplotype
inference program will be supported later (if I receive any request).
(3)
The V0.6 program generated another report file
(haploblock_chromosomes.txt) so that interpretation of results could be
a little bit easier. (suggested by Dr. Kostya Krutovskii
<kkrutovs@dendrome.ucdavis.edu>)
12-16-2002 A bug was
identified, which could removed the last singleton block (a block with
one SNP only). The problem was fixed.
11-02-2002 (1)The major
difference between version 0.4 and version 0.5 is the selection of
htSNPs. The goal of selecting htSNPs is to find a minimal set of SNPs
that can uniquely distinguish common haplotypes. In most cases, there
are more than one sets of htSNPs that meet the requirement. In version
0.4, the program simple report the first one it gets. In version 0.5,
two considerations were taken into account in order to select the best
one out of all candidate sets. Firstly, although several sets of htSNPs
may meet the same criterion (distinguishing common haplotypes), the
power for association study can be different among these sets. To find
the one with highest power, candidate sets are evaluated based on r^2,
which represents absolute level of linkage disequilibrium. For each
candidate set, the minimal pair-wise r^2 between all pair of tagged
SNPs and untagged SNP are calculated, then we choose the set with
highest min(r^2). However, even with this criteria, chances are that
there are still more than one good candidates. Therefore, we introduced
a third criteria, referred to a position score. The idea behind this is
pretty straight forward. HtSNPs will be used to map variant sites
associated with diseases. In the process of identifying htSNPs, we only
studied the SNP markers we typed. Is there any chance of losing the
power for those untyped markers? That is possible! An intuitive
solution to reduce such chances (no theoretical proof yet) would be to
select htSNPs that evenly spaced in the genome. The position score
proposed here is a composite index to measure the distribution of
htSNPs. In the ideal case, all htSNPs are evenly spaced across the
whole block. Therefore, the average distance between adjacent markers
would be L/(n-1), where L refers to the length of block, and n refers
to the number of htSNPs. We also denote the distance between adjacent
markers as X, then the positional score is |Mean(X) - L/(n-1))| +
STD(X). A lower score indicates a distribution closer to the ideal
case. In summary, we proposed three criteria for htSNPs (1)minimum
number of SNP; (2)max(min(r^2)); (3)min(position score). The three
criteria are ranked as their order. Criterion with lower rank will be
used only when there are multiple candidates that meet the criteria
with higher rank.
(2)A
bug that could cause problem in handling ambiguous base call was fixed.
10-21-2002 A few changes
were made on the subroutine for the htSNP identification, so that it
can not only find htSNPs that can distinguish all haplotypes, but can
also identify those distinguishing a certain percentage of haplotypes.
10-18-2002 hbfWrapper.pl
is released in the V0.4. It is basically a perl script to convert
haplotype files generated by Merlin to the fasta format required by the
program haploBlockFinder. Although the fasta format we initially
supported is very efficient to handle, the other format popular in the
genetic community, such as the one generated by Merlin, has a major
advantage in that it can handle in/del loci with more than one bases.
With this perl script, format conversion can be done automatically.
10-12-2002 Identification
of tagging SNPs for haplotype blocks has been implemented. Please note
that there is no clear conclusion regarding what htSNPs should be
chosen so far! Therefore, this program identifies a minimum set of SNP
that can distinguish ALL haplotypes in a block.
10-10-2002 Two type of
variant sites, insertion and deletion are supported.
10-06-2002 The 0.3
version is released. There are two major changes:
(1)
An allele frequency filter is implemented, so that analysis can be
focused on SNPs in some range of the allele frequency spectrum.
(2)
showLDmatrix.pl is released. It can draw a matrix-like diagram in order
to compare the haplotype blocks by this program and the pair-wise LD
patterns. Two pair-wise LD measurements are shown, D' and LLR. LLR is
the log likelihood ratio between observed haplotypes and expected
haplotypes assuming linkage equilibrium.
09-21-2002 A bug is found
in drawBlocks.pl, which results in miscalculation in canvas height in
compact mode. The problem is fixed.
09-11-2002 Another bug
(when there are many loci, the color allocation may run out of space
and all haplotypes in the rear part have the same color) in
drawBlocks.pl is found and fixed.
08-30-2002 Two bugs (one
related to the HSV color scheme and the other related to the compact
display mode) were fixed. A new option is added to display more than 4
haplotypes.
06-25-2002 drawBlocks.pl
is released.
06-14-2002 The first
version of this program is released.
PROBLEMS AND COMMENTS
Please address any comment or problem to:
kzhang@genetics.med.harvard.edu