This tutorial will demonstrate a variety of commands to execute in order to analyze a GWAS data set using Encore. In particular, this tutorial will aim to demonstrate how to:
- Prune a data set using Linkage Disequilibrium
- Filter a data set using linear modeling and Evaporative Cooling techniques
- Construct a reGAIN matrix to analyze interactions between SNPs
- Create a .sif file to be visually analyzed in Cytoscape
- Determine the most relevant SNPs to the phenotype from the data set using SNPrank
Encore enables users to take a SNP data set and apply a variety of statistical and computational methods to better understand the contributions of SNPs, genes, and pathways to the onset of a particular phenotype. ReGAIN and SNPrank, novel components of Encore, are used to build a SNP-SNP interaction matrix and determine the most relevant SNPs to the particular phenotype, respectively. Additionally, since file types are consistent throughout the software packages, Encore implements many PLINK and EC library commands and procedures.
Once Encore has been installed, the command
encore --help
displays the following usage screen:
Encore - a tool for analysis of GWAS and other biological data. Usage: encore -i snpdata.ped [mode] -o output-prefix: -i [ --input-file ] arg Input GWAS file (.bed or .ped) or GAIN/reGAIN matrix (tab- or comma-separated) -o [ --output-prefix ] arg Prefix to use for all output files -n [ --numeric ] arg Numeric file for quantitative data (uses PLINK covariate file format) -d [ --data-summary ] Simply print input file stats (for PLINK .ped/.bed files) *mode* -s [ --snprank ] Perform SNPrank analysis *mode* --gamma arg (=0.85) SNPrank algorithm damping factor -g [ --gain ] Calculate GAIN *mode* -r [ --regain ] Calculate regression GAIN *mode* --compress-matrices Write binary (compressed) reGAIN matrices --sif-threshold arg (=0.05) Numerical cutoff for SIF file (generated by reGAIN) interaction scores --fdr-prune FDR prune reGAIN interaction terms --fdr arg (=0.5) FDR value for BH method applied to reGAIN -e [ --ec ] Perform Evaporative Cooling (EC) analysis *mode* --ec-algorithm arg EC ML algorithm (all|rf|rj) --ec-snp-metric arg EC SNP metric (gm|am) --ec-num-target arg EC target number of attributes to keep --extract arg Extract list of SNPs from specified file --remove arg Remove list of individuals from specified file --keep arg Keep list of individuals from specified file --exclude arg Exclude list of SNPs --prune Remove individuals with missing phenotypes --covar arg Include covariate file in analysis --pheno arg Include alternate phenotype file in analysis --make-bed Make .bed, .fam and .bim *mode* --ci arg (=0.95) Confidence interval for CMH odds ratios --assoc Case/control, QTL association *mode* --linear Test for quantitative traits and multiple covariates *mode* --logistic Test for disease traits and multiple covariates *mode* --model Cochran-Armitage and full-model C/C association *mode* --model-trend Use CA-trend test from model *mode* --model-gen Use genotypic test from model *mode* --model-dom Use dominant test from model *mode* --model-rec Use recessive test from model *mode* --freq Allele frequencies *mode* --counts Modifies --freq to report actual allele counts --missing Missing rates (per individual, per SNP) *mode* --missing-genotype arg (="0") Missing genotype code --maf arg (=0.0) Minor allele frequency --geno arg (=1) Maximum per-SNP missing --mind arg (=1) Maximum per-person missing --hwe arg (=0.001) Hardy-Weinberg disequilibrium p-value (exact) --hwe2 arg (=0.001) Hardy-Weinberg disequilibrium p-value (asymptotic) --1 0/1 unaffected/affected coding --filter-founders Include only founders --map3 Specify 3-column MAP file format --no-sex PED file does not contain column 5 (sex) --allow-no-sex Do not set ambiguously-sexed individuals missing --no-parents PED file does not contain columns 3,4 (parents) --no-fid PED file does not contain columns 1 (family ID) --r Pairwise SNPxSNP LD (r) *mode* --r2 Pairwise SNPxSNP LD (r^2) *mode* -l [ --ld-prune ] Linkage disequilibrium (LD) pruning *mode* -h [ --help ] display this help screen
This help page displays the various command-line parameters supported by Encore. As can be seen in the text, several commonly used arguments have convenient short versions, e.g.
-i
for
--input-file
. The rest of the tutorial will illustrate the usage of several of these parameters on a sample data set. To convey the example, this tutorial assumes the creation of a directory that contains three files: a .bed, .fam, and .bim, all of which have the same prefix. This tutorial will assume the file prefix
testing
for the test data files.
One of the features implemented on a raw data set is Linkage Disequilibrium (LD) pruning. This feature uses the original PLINK library feature and can be used to generate a more beneficial subset of SNPs from the original data set. The command-line execution is straightforward:
encore -i testing.bed --ld-prune -o LDlist
In the command above, the –ld-prune argument performs the LD-pruning procedure. While the
-i
(
--input-file
) and
-o
(
--output-prefix
) are fairly self-explanatory, a noteworthy feature is the specification of the file extension of the input file. While other libraries (such as PLINK) do not always require the extension, its inclusion in our code allows for greater specificity. We feel like separate flags for plaintext and binary files (
--file
and
--bfile
in PLINK) can be confusing and frustrating, and a single
--input-file
flag can intelligently handle several types of files.
The output of this data pruning is two files:
LDlist.prune.in
and
LDlist.prune.out
. The first file is simply a list of SNPs that passed the LD prune while the
.out
file is a list of all the SNPs that were excluded during the pruning.
These files prove useful for reconstructing one’s data set. Arguments borrowed from PLINK can be used for this task by either keeping or extracting a certain SNP list. The example command shown below uses –extract with the SNPs that passed the LD prune (in
LDList.prune.in
), but an analogous command could easily be written using
--exclude LDList.prune.out
, which removes the SNPs that failed the LD pruning. Though these two parameters function differently, they produce an identical data set.
encore -i testing.bed --extract LDList.prune.in --make-bed -o LDdata
While the possibility exists that an LD prune could remove a SNP that would be useful for further analysis, this method remains beneficial in beginning an analysis.
After pruning a data set, the application of a filter optimizes the execution of other features in this package. One method of filtering relies on a linear pairwise calculation, which can optionally accept a covariate file (an example of using a covariate file in a filter will be shown later in this tutorial). For example, one may use a filter on the data to remove extraneous SNPs from the LD-pruned data set. This filter can be executed with the following command:
encore -i LDdata.bed --linear -o lfilteredLDdata
Another parameter,
--pheno
, allows for a file such as
testing-altpheno.txt
to be used in the command if the phenotype specified in the
.bed
file is different than the one desired. This parameter can be easily integrated into the method of filtering shown above, as well as with many other commands. Modifying the previous command to account for the alternate phenotype file would yield:
encore -i LDdata.bed --linear --pheno testing-altpheno.txt -o lfilteredLDdata2
A key feature of this method of filtering is the ability to incorporate covariate files in the filtering. For example, the file
testing.cov
is numeric covariate file that could be included in the filtering process. The modification of the original filtering command would yield:
encore -i LDdata.bed --covar testing.cov --linear --o lfilteredLDdata3
which produces a filtered file that can be used in subsequent steps.
A different filtering mechanism unique to Encore is the Evaporative Cooling method, or EC. While EC offers a more sophisticated algorithm of filtering, EC does not currently allow for covariate data to be considered in the filtering process. Running EC on the LD-pruned data set that was previously generated would be executed by the following command:
encore -i LDdata.bed --ec -o ec-out
This command produces
ec-out.ec
which contains two columns: one with the list of SNPs and a corresponding column that lists the EC score. This file can be easily modified to produce a top SNP list to be used in subsequent methods.However, EC offers an additional command
--ec-num-target
that can filter the results to the top X SNPs, where X is the argument to the parameter. For instance, if one wanted the top 1,000 SNPs from the
LDdata
data set, the following command would produce a file similar to
ec-out.ec
but would only include the top 1,000 SNPs and their correspondingscores.
encore -i LDdata.bed --ec --ec-num-target 1000
The next step in the process involves the creation of the GAIN (Genetic Association Interaction Network). In particular, regression GAIN, or reGAIN, creates a matrix that measures the interaction of any two particular SNPs through a linear or logistic regression computation. Though measures can be taken to reduce the size of the
resulting matrix, computing the interactions between greater than 10,000 SNPs is not advised, as the computational and memory requirements can be taxing. To create a reGAIN matrix on LDdata using the top 1,000 SNPs generated from the EC filter, the following command is used:
encore -i LDdata.bed --regain --extract ec_topsnps.txt -o ec-top1k
, where
ec_topsnps.txt
is the second column from the
ec-out.ec
file. (This text file can be easily created using the command
awk ‘{print $2}’ ec-out.ec > ec_topsnps.txt
)
This command will produce several output files, notably
ec-top1k.regain
and
ec-top1k.sif
. While the
.regain
file will be re-used in examples below, the
.sif
file can be imported as a network file in Cytoscape. Cytoscape creates a visual representation of the data generated from the reGAIN analysis, enabling the user to visualize the pairwise interactions between SNPs in a more natural setting.
To further minimize the output file size of the regain matrix, the command argument
--compress-matrices
can be employed, producing the output file
ec-top1k.regain.gz
from the command:
encore -i LDdata.bed --regain --compress-matrices --extract ec_topsnps.txt -o ec-top1k
The binary or compressed reGAIN matrix file is essentially the same file format as the plaintext version, but uses roughly one-third of the disk space.
Once the reGAIN matrix has been created, the next step in this progression is to run SNPrank, another novel feature of Encore. SNPrank uses the
.regain
file to determine which SNPs are most relevant to the particular phenotype. The following command takes the compressed reGAIN matrix computed in the last example and creates a list ranking the most significant SNPs:
encore -i ec-top1k.regain.gz --snprank -o ranked-top1k
This command produces the file
ranked-top1k.snprank
, which displays the top SNPs associated with the phenotype.
By mapping these SNP names to the genes that they contain, one can quickly determine the genes associated with a particular phenotype in a data-driven analysis. This can be accomplished using our convenient snp2gene web service. Simply copy and paste the SNPs to map (one per line), hit search, and the results will be displayed. This service should probably be used for a handful of SNPs at a time, 100 or less.