* FILE:    README.txt
* AUTHOR:  Bill Howells
* DATE:    12/8/2014
* PURPOSE: document output of sas datasets to csv for NICSNP drupal download page

1. input files:
   data.sas7bdat:    NICSNP full marker data
   data_cg.sas7bdat: NICSNP candidate gene data

2. output files:
   data_cellid.csv:    NICSNP full marker data, id variable renamed to cell_id, same values
   data_cg_cellid.csv: NICSNP candidate gene data, id variable renamed to cell_id, same values

3. NICSNP candidate gene dataset data_cg.sas7bdat was output to csv with the SAS export wizard

4. due to wide size the full marker dataset named data.sas7bdat was output with
   a custom SAS export program named ~/nida/billh/nicotine/NICSNP/programs/

5. the only difference between original files, eg. data.sas7bdat, and the new
   files, eg. data_cellid.csv, is that the first column was renamed from ind_id->cell_id
   for compatibility with the distribution column names.  confusingly, the values in the
   original files under the ind_id column were in fact cell line ids identical
   to the values in the distribution file under the column named cell_id
   so the change makes the merge between marker data and distribution file data
   possible with a simple join such as . . .

   select * from data_cellid a left join nicotine_dist8_0 b
     on a.cell_id=b.cell_id

6. the files were then archived with the zip program and transferred securely
   to the drupal virutal machine

7.  original documentation follows.  

Scott Saccone


1 observation per individual.

The variables afdNNNNN are the genotypes where NNNNN is the internal
perlegen ID, whose variable name is 'snp_id' in the dataset
'markers.sas7bdat' described below.

 * The genotypes were converted from "A T" format to "A/T", and I made
   sure these were reported alphabetically, so it would read "A/T"
   instead of "T/A".

 * If an "N" occurred in the genotype this variable was set to missing

 * The remaining variables are phenotypic and were derived from the
   dataset perlegen_all_v4.sas7bdat which was downloaded from the WU
   NICSNP web site.


1 observation per SNP.

 * All of the variables except for final_cg_snp_list and
   final_sel_cand_genes were taken from a file we received from
   Perlegen. These variables are documented in NIDA_IG_snp_info.doc.

 * The variable final_cg_snp_list is binary (0/1) and indicates if the
   SNP is on the final list of SNPs selected for candidate genes. This
   information represents an update to the information found in the
   variable sel_cand_genes. The variable final_sel_cand_genes
   indicates which genes the SNP was selected for, except when the
   value is "NIDA_CHOSEN" which indicates the SNP was selected for
   other reasons.


Data for SNPs selected for candidate genes only according
to the variable 'final_cg_snp_list' in the markers dataset.
The format of this dataset is the same as data.sas7bdat.