GAIN/SNPrank analysis pipeline

Introduction

GAIN and SNPrank are useful in determining relevant single nucleotide
polymorphisms (SNPs) to a given phenotype. In concert with PLINK (a
third-party tool), these can provide a powerful analysis engine in
gauging SNP relevancy based on a specified phenotype.

Assumptions

Initial data is assumed to be SNP genotypic data with a single
phenotype column in CSV format. Both GAIN and SNPrank assume an initial
pre-processing step to filter the input data with PLINK. PLINK is a
free, open-source whole genome association analysis command-line tool.

Dependencies

1. PLINK binaries for Mac, Linux, and Windows can be downloaded at
http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml

2. Python is required for the GAIN and SNPrank tools. Version 2.6.5 has
been tested, but 2.7.x should also work. Python is available for
several platforms at:
http://python.org/download

3. csv2plink.py is a Python script that converts an input CSV file to
corresponding PLINK .map and .ped files. This script is hosted on
Github at
http://github.com/insilico/converters

4. SNPrank requires the NumPy numerical computation library for linear
algebra operations. SNPrank was tested with NumPy 1.3.0. Information on
installing NumPy is here:
http://numpy.scipy.org

5. For optional GPU support, SNPrank requires the CUDA drivers
available from NVIDIA at
http://developer.nvidia.com/object/cuda_download.html

6. Additionally, a Python CUDA matrix library called CUDAMat is used
for the linear algebra operations on the GPU. CUDAMat is hosted on
Google Code and can be downloaded at http://code.google.com/p/cudamat

Downloading the tools

pygain (Python implementation of GAIN) can be downloaded from Github http://github.com/insilico/pygain.
Click the Downloads button on the right for the latest tagged release
(0.1.0 as of this writing).

pysnprank (Python implementation of SNPrank) is also available via
Github
http://github.com/insilico/pysnprank
. Click the Downloads button
on the right for the latest tagged release (0.1.0 as of this writing).

Instructions

1. Convert the input CSV to PLINK .map and .ped files using the
csv2plink.py utility

$ csv2plink.py -c 1 -p 1 sample-data.csv sample-data

2. Use PLINK to recode the .map and .ped files as a .raw file

$ plink --file sample-data --recodeA --map3 --out sample-data

(If there are errors with the PLINK recode, try adding options to
exclude missing data columns):

$ plink --file sample-data --recodeA --map3 --no-sex --no-fid --no-parents --missing-genotype ? --out sample-data

3. Once the PLINK command successfully generates the .raw file, run
GAIN on the PLINK .raw data

$ gain.py -i sample-data.raw -o sample-data.gain

4. Run SNPrank on the GAIN matrix data to output a ranked list of
SNPs

$ snprank.py -i sample-data.gain -o sample-data-rankings.txt