Encore Tutorial

This tutorial will demonstrate a variety of commands to execute in order to analyze a GWAS data set using Encore. In particular, this tutorial will aim to demonstrate how to:

  • Prune a data set using Linkage Disequilibrium
  • Filter a data set using linear modeling and Evaporative Cooling techniques
  • Construct a reGAIN matrix to analyze interactions between SNPs
  • Create a .sif file to be visually analyzed in Cytoscape
  • Determine the most relevant SNPs to the phenotype from the data set using SNPrank

Encore enables users to take a SNP data set and apply a variety of statistical and computational methods to better understand the contributions of SNPs, genes, and pathways to the onset of a particular phenotype. ReGAIN and SNPrank, novel components of Encore, are used to build a SNP-SNP interaction matrix and determine the most relevant SNPs to the particular phenotype, respectively. Additionally, since file types are consistent throughout the software packages, Encore implements many PLINK and EC library commands and procedures.

Once Encore has been installed, the command
encore --help
displays the following usage screen:

Encore - a tool for analysis of GWAS and other biological data.
Usage:  encore -i snpdata.ped [mode] -o output-prefix:

 -i [ --input-file ] arg       Input GWAS file (.bed or .ped) or GAIN/reGAIN
                               matrix (tab- or comma-separated)
 -o [ --output-prefix ] arg    Prefix to use for all output files
 -n [ --numeric ] arg          Numeric file for quantitative data (uses PLINK
                               covariate file format)
 -d [ --data-summary ]         Simply print input file stats (for PLINK
                               .ped/.bed files) *mode*
 -s [ --snprank ]              Perform SNPrank analysis *mode*
 --gamma arg (=0.85)           SNPrank algorithm damping factor
 -g [ --gain ]                 Calculate GAIN *mode*
 -r [ --regain ]               Calculate regression GAIN *mode*
 --compress-matrices           Write binary (compressed) reGAIN matrices
 --sif-threshold arg (=0.05)   Numerical cutoff for SIF file (generated by
                               reGAIN) interaction scores
 --fdr-prune                   FDR prune reGAIN interaction terms
 --fdr arg (=0.5)              FDR value for BH method applied to reGAIN
 -e [ --ec ]                   Perform Evaporative Cooling (EC) analysis
                               *mode*
 --ec-algorithm arg            EC ML algorithm (all|rf|rj)
 --ec-snp-metric arg           EC SNP metric (gm|am)
 --ec-num-target arg           EC target number of attributes to keep
 --extract arg                 Extract list of SNPs from specified file
 --remove arg                  Remove list of individuals from specified file
 --keep arg                    Keep list of individuals from specified file
 --exclude arg                 Exclude list of SNPs
 --prune                       Remove individuals with missing phenotypes
 --covar arg                   Include covariate file in analysis
 --pheno arg                   Include alternate phenotype file in analysis
 --make-bed                    Make .bed, .fam and .bim *mode*
 --ci arg (=0.95)              Confidence interval for CMH odds ratios
 --assoc                       Case/control, QTL association *mode*
 --linear                      Test for quantitative traits and multiple
                               covariates *mode*
 --logistic                    Test for disease traits and multiple covariates
                               *mode*
 --model                       Cochran-Armitage and full-model C/C association
                               *mode*
 --model-trend                 Use CA-trend test from model *mode*
 --model-gen                   Use genotypic test from model *mode*
 --model-dom                   Use dominant test from model *mode*
 --model-rec                   Use recessive test from model *mode*
 --freq                        Allele frequencies *mode*
 --counts                      Modifies --freq to report actual allele counts
 --missing                     Missing rates (per individual, per SNP) *mode*
 --missing-genotype arg (="0") Missing genotype code
 --maf arg (=0.0)              Minor allele frequency
 --geno arg (=1)               Maximum per-SNP missing
 --mind arg (=1)               Maximum per-person missing
 --hwe arg (=0.001)            Hardy-Weinberg disequilibrium p-value (exact)
 --hwe2 arg (=0.001)           Hardy-Weinberg disequilibrium p-value
                               (asymptotic)
 --1                           0/1 unaffected/affected coding
 --filter-founders             Include only founders
 --map3                        Specify 3-column MAP file format
 --no-sex                      PED file does not contain column 5 (sex)
 --allow-no-sex                Do not set ambiguously-sexed individuals
                               missing
 --no-parents                  PED file does not contain columns 3,4 (parents)
 --no-fid                      PED file does not contain columns 1 (family ID)
 --r                           Pairwise SNPxSNP LD (r) *mode*
 --r2                          Pairwise SNPxSNP LD (r^2) *mode*
 -l [ --ld-prune ]             Linkage disequilibrium (LD) pruning *mode*
 -h [ --help ]                 display this help screen

This help page displays the various command-line parameters supported by Encore. As can be seen in the text, several commonly used arguments have convenient short versions, e.g.
-i
for
--input-file
. The rest of the tutorial will illustrate the usage of several of these parameters on a sample data set. To convey the example, this tutorial assumes the creation of a directory that contains three files: a .bed, .fam, and .bim, all of which have the same prefix. This tutorial will assume the file prefix
testing
for the test data files.

One of the features implemented on a raw data set is Linkage Disequilibrium (LD) pruning. This feature uses the original PLINK library feature and can be used to generate a more beneficial subset of SNPs from the original data set. The command-line execution is straightforward:

encore -i testing.bed --ld-prune -o LDlist

In the command above, the –ld-prune argument performs the LD-pruning procedure. While the
-i
(
--input-file
) and
-o
(
--output-prefix
) are fairly self-explanatory, a noteworthy feature is the specification of the file extension of the input file. While other libraries (such as PLINK) do not always require the extension, its inclusion in our code allows for greater specificity. We feel like separate flags for plaintext and binary files (
--file
and
--bfile
in PLINK) can be confusing and frustrating, and a single
--input-file
flag can intelligently handle several types of files.

The output of this data pruning is two files:
LDlist.prune.in
and
LDlist.prune.out
. The first file is simply a list of SNPs that passed the LD prune while the
.out
file is a list of all the SNPs that were excluded during the pruning.

These files prove useful for reconstructing one’s data set. Arguments borrowed from PLINK can be used for this task by either keeping or extracting a certain SNP list. The example command shown below uses –extract with the SNPs that passed the LD prune (in
LDList.prune.in
), but an analogous command could easily be written using
--exclude LDList.prune.out
, which removes the SNPs that failed the LD pruning. Though these two parameters function differently, they produce an identical data set.

encore -i testing.bed --extract LDList.prune.in --make-bed -o LDdata

While the possibility exists that an LD prune could remove a SNP that would be useful for further analysis, this method remains beneficial in beginning an analysis.

After pruning a data set, the application of a filter optimizes the execution of other features in this package. One method of filtering relies on a linear pairwise calculation, which can optionally accept a covariate file (an example of using a covariate file in a filter will be shown later in this tutorial). For example, one may use a filter on the data to remove extraneous SNPs from the LD-pruned data set. This filter can be executed with the following command:

encore -i LDdata.bed --linear -o lfilteredLDdata

Another parameter,
--pheno
, allows for a file such as
testing-altpheno.txt
to be used in the command if the phenotype specified in the
.bed
file is different than the one desired. This parameter can be easily integrated into the method of filtering shown above, as well as with many other commands. Modifying the previous command to account for the alternate phenotype file would yield:

encore -i LDdata.bed --linear --pheno testing-altpheno.txt -o lfilteredLDdata2

A key feature of this method of filtering is the ability to incorporate covariate files in the filtering. For example, the file
testing.cov
is numeric covariate file that could be included in the filtering process. The modification of the original filtering command would yield:

encore -i LDdata.bed --covar testing.cov --linear --o lfilteredLDdata3

which produces a filtered file that can be used in subsequent steps.

A different filtering mechanism unique to Encore is the Evaporative Cooling method, or EC. While EC offers a more sophisticated algorithm of filtering, EC does not currently allow for covariate data to be considered in the filtering process. Running EC on the LD-pruned data set that was previously generated would be executed by the following command:

encore -i LDdata.bed --ec -o ec-out

This command produces
ec-out.ec
which contains two columns: one with the list of SNPs and a corresponding column that lists the EC score. This file can be easily modified to produce a top SNP list to be used in subsequent methods.However, EC offers an additional command
--ec-num-target
that can filter the results to the top X SNPs, where X is the argument to the parameter. For instance, if one wanted the top 1,000 SNPs from the
LDdata
data set, the following command would produce a file similar to
ec-out.ec
but would only include the top 1,000 SNPs and their correspondingscores.

encore -i LDdata.bed --ec --ec-num-target 1000

The next step in the process involves the creation of the GAIN (Genetic Association Interaction Network). In particular, regression GAIN, or reGAIN, creates a matrix that measures the interaction of any two particular SNPs through a linear or logistic regression computation. Though measures can be taken to reduce the size of the
resulting matrix, computing the interactions between greater than 10,000 SNPs is not advised, as the computational and memory requirements can be taxing. To create a reGAIN matrix on LDdata using the top 1,000 SNPs generated from the EC filter, the following command is used:

encore -i LDdata.bed --regain --extract ec_topsnps.txt -o ec-top1k

, where
ec_topsnps.txt
is the second column from the
ec-out.ec
file. (This text file can be easily created using the command
awk ‘{print $2}’ ec-out.ec > ec_topsnps.txt
)
This command will produce several output files, notably
ec-top1k.regain
and
ec-top1k.sif
. While the
.regain
file will be re-used in examples below, the
.sif
file can be imported as a network file in Cytoscape. Cytoscape creates a visual representation of the data generated from the reGAIN analysis, enabling the user to visualize the pairwise interactions between SNPs in a more natural setting.

To further minimize the output file size of the regain matrix, the command argument
--compress-matrices
can be employed, producing the output file
ec-top1k.regain.gz
from the command:

encore -i LDdata.bed --regain --compress-matrices --extract ec_topsnps.txt -o ec-top1k

The binary or compressed reGAIN matrix file is essentially the same file format as the plaintext version, but uses roughly one-third of the disk space.

Once the reGAIN matrix has been created, the next step in this progression is to run SNPrank, another novel feature of Encore. SNPrank uses the
.regain
file to determine which SNPs are most relevant to the particular phenotype. The following command takes the compressed reGAIN matrix computed in the last example and creates a list ranking the most significant SNPs:

encore -i ec-top1k.regain.gz --snprank -o ranked-top1k

This command produces the file
ranked-top1k.snprank
, which displays the top SNPs associated with the phenotype.

By mapping these SNP names to the genes that they contain, one can quickly determine the genes associated with a particular phenotype in a data-driven analysis. This can be accomplished using our convenient snp2gene web service. Simply copy and paste the SNPs to map (one per line), hit search, and the results will be displayed. This service should probably be used for a handful of SNPs at a time, 100 or less.