***********************************************************************************;
* FILE: README.txt
* AUTHOR: Bill Howells
* DATE: 12/8/2014
* PURPOSE: document output of sas datasets to csv for NICSNP drupal download page
***********************************************************************************;
NOTES:

1. input files:
data.sas7bdat: NICSNP full marker data
data_cg.sas7bdat: NICSNP candidate gene data

2. output files:
data_cellid.csv: NICSNP full marker data, id variable renamed to cell_id, same values
data_cg_cellid.csv: NICSNP candidate gene data, id variable renamed to cell_id, same values

3. NICSNP candidate gene dataset data_cg.sas7bdat was output to csv with the SAS export wizard

4. due to wide size the full marker dataset named data.sas7bdat was output with
a custom SAS export program named ~/nida/billh/nicotine/NICSNP/programs/
NICSNP_drupal_output_marker_cell_id.sas

5. the only difference between original files, eg. data.sas7bdat, and the new
files, eg. data_cellid.csv, is that the first column was renamed from ind_id->cell_id
for compatibility with the distribution column names. confusingly, the values in the
original files under the ind_id column were in fact cell line ids identical
to the values in the distribution file under the column named cell_id
so the change makes the merge between marker data and distribution file data
possible with a simple join such as . . .

select * from data_cellid a left join nicotine_dist8_0 b
on a.cell_id=b.cell_id

6. the files were then archived with the zip program and transferred securely
to the drupal virutal machine

7. original documentation follows.

Scott Saccone
3/16/06

data.sas7bdat
-------------

1 observation per individual.

The variables afdNNNNN are the genotypes where NNNNN is the internal
perlegen ID, whose variable name is 'snp_id' in the dataset
'markers.sas7bdat' described below.

* The genotypes were converted from "A T" format to "A/T", and I made
sure these were reported alphabetically, so it would read "A/T"
instead of "T/A".

* If an "N" occurred in the genotype this variable was set to missing
("").

* The remaining variables are phenotypic and were derived from the
dataset perlegen_all_v4.sas7bdat which was downloaded from the WU
NICSNP web site.

markers.sas7bdat
----------------

1 observation per SNP.

* All of the variables except for final_cg_snp_list and
final_sel_cand_genes were taken from a file we received from
Perlegen. These variables are documented in NIDA_IG_snp_info.doc.

* The variable final_cg_snp_list is binary (0/1) and indicates if the
SNP is on the final list of SNPs selected for candidate genes. This
information represents an update to the information found in the
variable sel_cand_genes. The variable final_sel_cand_genes
indicates which genes the SNP was selected for, except when the
value is "NIDA_CHOSEN" which indicates the SNP was selected for
other reasons.

data_cg.sas7bdat
----------------

Data for SNPs selected for candidate genes only according
to the variable 'final_cg_snp_list' in the markers dataset.
The format of this dataset is the same as data.sas7bdat.