ReliefSeq

ReliefSeq is a feature (attribute) selection and ranking algorithm
written in C++ designed to handle various types of genetic features
including combinations of feature data types and endpoints (phenotypes/classes).
For example, SNPs and gene expression features can be analyzed with
discrete or continuous phenotypes (classes). The full list of command line
options are listed by typing ‘reliefseq’ or ‘reliefseq –help’ at the
command prompt.

Installation

ReliefSeq can be installed from the github repository located at
Insilico ReliefSeq
github
. Instructions for installation are provided there.

ReliefSeq on Digital Gene Expression (DGE) RNA-Seq Data Sets

ReliefSeq will rank the features (genes) in an RNA-Seq digital gene
expression data set. The features in these data sets are counts, so
ReliefSeq treats them as “numeric” attributes. Even though ReliefSeq
can handle numeric attributes in a general way with the ‘-n’
command line option, a separate phenotype file must be specified with
a ‘-a’, or alternate phenotype file, command line option.
Furthermore, these files require ID matching (see ReliefSeq on Gene
Expression Data Sets below). To avoid this requirement, use of a special DGE
format CSV file that contains phenotypes and numeric attributes together can
be used. The CSV file contains as a file header the case-control phenotypes
(1/0). The remaining lines contain the genes and their counts. The following
shows the first few lines of an example file:

,0,0,0,0,0,1,1,1,1
7A,24,16,57,30,17,27,36,25,31
A1BG,319,180,288,112,109,233,143,251,169
A1CF,3,1,2,2,0,2,2,0,3
A26C1A,0,1,2,0,0,0,2,0,0
A26C1B,0,0,1,2,3,2,1,2,0
A26C3,3,5,3,5,3,17,3,5,7
A2BP1,1,0,0,4,0,1,1,0,1
A2M,1865,1250,3061,2070,1164,2337,1142,3209,1914
A2ML1,3,1,1,5,0,4,0,2,2

Note the first line contains a leading ‘,’ to skip the gene label column.
Therefore, the phenotype labels correspond to samples in the data set.
To analyze a file of this format, invoke ReliefSeq with a command line
of the form:

$ reliefseq --dge-counts-data example.csv -o example

This command produces a tab-delimited file named example.reliefseq
that contains a list of genes ranked by Relief-F score. For example:

0.280785	NDUFA
0.280724	MRPS
0.275989	UBE2G
0.250673	P4HB
0.250568	BRUNOL
0.247726	PSMA
0.245764	PHTF
0.242897	ZFP36L2
0.232065	LBH
0.213614	STAT5B

We have found that ReliefSeq for RNA-Seq data works best when the
k-nearest neighbors are optimized for each attribute in the data set.
The following options are used to optimize the k-nearest neighbors.
ReliefSeq provides command line options for specifying the optimization
of k-nearest neighbors.

-k [ --k-nearest-neighbors ] arg (=10)
                                      set k nearest neighbors (0=optimize k)
--kopt-begin arg (=1)                 optimize k starting with kopt-begin
--kopt-end arg (=1)                   optimize k ending with kopt-end
--kopt-step arg (=1)                  optimize k incrementing with kopt-step
--write-best-k                        optimize k, write best k's
--write-each-k-scores                 optimize k, write best scores for each 

For instance an example command line to run the full range of nearest
neighbors for a data set having 24 cases and 24 controls (maximum of
23 nearest neighbors) is shown below:

$ reliefseq --dge-counts-data example.csv -k 0 --kopt-begin 1 --kopt-end 23 -o example

For a data set containing over 16,000 genes, this analysis takes about
10 minutes. The resulting scores file contains ranked genes by the optimum
k-nearest neighbors. The other options are for writing the scores files
for each k tried and for writing a file that lists the best k found for
each gene.

ReliefSeq on SNP Data Sets

Related Command Line Options

-s [ --snp-data ] arg                 read SNP attributes from genotype 
                                      filename: txt, ARFF, plink (map/ped, 
                                      binary, raw)
--snp-file-type arg                   Ignore file extension and use type: 
                                      textwhitesp, wekaarff, plinkped, 
                                      plinkbed, plinkraw, dge, birdseed
--snp-metric arg (=gm)                metric for determining the difference 
                                      between subjects (gm|am|nca|nca6)
-B [ --snp-metric-nn ] arg (=gm)      metric for determining the difference 
                                      between subjects (gm|am|nca|nca6|km)
-W [ --snp-metric-weights ] arg (=gm) metric for determining the difference 
                                      between SNPs (gm|am|nca|nca6)

The most basic SNP analysis is to specify a SNP/GWAS data file:

$ reliefseq --snp-data data_file.ext

‘ext’ is used to determine the format of the SNP file. The following
‘ext’ values are recognized by ReliefSeq:

File Extension Details
txt tab-delimited header followed by data, class column designated “Class” in
the header line (originally the only supported format)
map/ped PLINK map/ped file; either map or ped is recognized
bed/bim/fam PLINK binary encoded map/ped; any of bed, bim or fam is recognized
raw PLINK RAW file from –recodeA PLINK operation (similar to txt format)
arff Weka attribute relation file format (using nominals encoded {0,1,2})

Many messages are sent to the console (stdout) to keep the user informed of
the algorithm’s progress. The resulting ranked attributes are stored in the
file reliefseq_default.reliefseq. The name of the output file can be changed
with the command line option –out-files-prefix, in which case the prefix is
used to produce output filenames of the form out-files-prefix.reliefseq. The
reliefseq program will report to the console the exact name used. The format
of the output scores files is a two-column, tab-delimited text file of sorted
scores and attribute names.

SNP-only, continuous phenotype (discrete-continuous) Analysis

$ reliefseq --snp-data data_file.ext

If the phenotype in a SNP file is found to be continuous regression ReliefF
(RReliefF) algorithm is invoked. The phenotype type is determined
from the phenotypes in the file, or an alternate phenotype file can be used
with the –alternate-pheno-file to override the phenotypes in the data_file.
If the phenotype is “1” or “2” in the case of PLINK files, or “0” and “1” in
the case of txt and ARFF files, the phenotype is considered case-control.
Otherwise, the phenotype is assumed to be continuous. The same is true of the
alternate phenotype file. The format of the alternate phenotype file is a
three-column, tab-delimited text file. This is the same as PLINK’s phenotype
file format and has the following required columns:

FID family ID IID individual ID PHENOTYPE value

NOTE: It is assumed the phenotype file has NO HEADER (in contrast to PLINK
where it is optional). ADDITIONAL NOTE: See “A Note about IDs” below for
important details about ID matching. The third column of values in the
phenotype file replaces the phenotypes read from the SNPs file.

ReliefSeq on Gene Expression (or Other Numeric) Data Sets

Related Command Line Options

-n [ --numeric-data ] arg             read continuous attributes from 
                                      PLINK-style covar file
-N [ --numeric-metric ] arg (=manhattan)
                                      metric for determining the difference 
                                      between numeric attributes 
                                      (manhattan=|euclidean)
-a [ --alternate-pheno-file ] arg     specifies an alternative 
                                      phenotype/class label file; one value 
                                      per line
$ reliefseq -n data_file.dat --alternate-pheno-file discrete_class.pheno

With this combination numeric attributes such as gene expression or other
continuous genetic measurements can be used. Note the alternate phenotype file
option is required for numeric-only attributes. The –numeric-metric command
line option is used to specify the metric used for distance between instances.
While not treated as covariates, the PLINK covariate file format is used to specify
the numeric variables.

A Note about IDs

Note that like phenotype files described above, the ID fields are important
and can be used to effectively filter the data set in various ways through
ID matching. The IID field (second column) must match the IID field in the
PLINK format SNP files or be an eight-character, zero-padded sequence
beginning with ‘00000001’ and incrementing by one for each line/instance in
the file (for txt and RAW files). (This encoding insures a strict ordering of
instances that affects the selecting between ties in the nearest neighbor
algorithm, which affects algorithm validation by matching the Weka machine
learning system results.) The numeric and phenotype files’ IDs are read and
intersected to find common IDs. Then if present, the SNP data set is read,
keeping only the IDs that matched with the numeric and phenotype files.
Finally, if any phenotypes are missing from an alternate phenotype file,
these instances are removed from algorithmic consideration. In this way a SNP
and/or numeric file can be used with several different phenotype files with
different individuals. One should always read the console output/log carefully
to make sure the number of instances in the final analysis meets expectations.

ReliefSeq on Integrated Data Sets

Combining both discrete and continuous attributes is referred to as
“integrated analysis”. In the case of integrated attributes, both types of
distance measures are used and can be overridden with the command line
options –snp-metric and –numeric-metric. ReliefF is used with discrete
class read from the SNP file. RReliefF is used if continuous phenotypes are
detected (as described above).

A Note on Missing Values

Missing values are handled for all types of data sets supported. Each data set
reader considers the missing encoding(s) for its particular file format.
The following table summarizes the missing genotype values recognized by each
reader.

TXT 9 or ? or empty string
ARFF ?
PLINK RAW NA
PLINK PED ‘0 0’
PLINK BED bit string ‘10’

Missing SNP values in ReliefF are handled by algorithms as described in
section “2.2. RELIEFF – EXTENSION” in the paper “Theoretical and Empirical
Analysis of ReliefF and RReliefF”, Machine Learning Journal (2003) 53:23-69.

For continuous values, the normalized difference is used, as in the Weka
machine learning system.

Missing phenotypes cause the file reader to skip the individual/instance with
a warning message and subsequent reduction in reported number of instances in
the program output. Missing phenotypes for TXT and ARFF files are any encoded
-9. For PLINK formats, missing phenotypes are 0 or -9 for SNPs and -9 for
continuous phenotypes.

A Note on Weighting by Distance in ReliefF

When computing nearest neighbors, the influence of distance between an
instance and its nearest neighbors is considered equal by default; that is,
the distances are used only to rank the neighbors. In both SNPs and continuous
attributes, the influence of each ranked neighbor can be taken into
consideration by applying a weighting factor to each neighbor’s distance.
This is particularly important in the case of regression ReliefF, since it
uses the distance between instances as a way of making a kind of
hits-and-misses analogy to the standard ReliefF algorithm. For more details
see section “2.3. RRELIEFF – IN REGRESSION” in the paper “Theoretical and
Empirical Analysis of ReliefF and RReliefF”, Machine Learning Journal
(2003) 53:23-69. See “Overriding ReliefF Default Algorithm Parameters” above
for the command line options for using this feature.

A Note on Multiclass Phenotypes

Multiclass phenotypes are implemented in the ReliefF C++ class.; however, the
PLINK data set readers restrict phenotypes to case-control, since PLINK does
not support multiclass phenotypes and this feature has not been needed.
Multiclass is supported in TXT and ARFF formats, though results are
unpredictable if the class column is not coded as integers (the first character
of whatever is read as the class column is converted to an integer, effectively
limiting classes to ten levels: 0-9).