***********************************************************************************; * FILE: README.txt * AUTHOR: Bill Howells * DATE: 12/8/2014 * PURPOSE: document output of sas datasets to csv for NICSNP drupal download page ***********************************************************************************; NOTES: 1. input files: data.sas7bdat: NICSNP full marker data data_cg.sas7bdat: NICSNP candidate gene data 2. output files: data_cellid.csv: NICSNP full marker data, id variable renamed to cell_id, same values data_cg_cellid.csv: NICSNP candidate gene data, id variable renamed to cell_id, same values 3. NICSNP candidate gene dataset data_cg.sas7bdat was output to csv with the SAS export wizard 4. due to wide size the full marker dataset named data.sas7bdat was output with a custom SAS export program named ~/nida/billh/nicotine/NICSNP/programs/ NICSNP_drupal_output_marker_cell_id.sas 5. the only difference between original files, eg. data.sas7bdat, and the new files, eg. data_cellid.csv, is that the first column was renamed from ind_id->cell_id for compatibility with the distribution column names. confusingly, the values in the original files under the ind_id column were in fact cell line ids identical to the values in the distribution file under the column named cell_id so the change makes the merge between marker data and distribution file data possible with a simple join such as . . . select * from data_cellid a left join nicotine_dist8_0 b on a.cell_id=b.cell_id 6. the files were then archived with the zip program and transferred securely to the drupal virutal machine 7. original documentation follows. Scott Saccone 3/16/06 data.sas7bdat ------------- 1 observation per individual. The variables afdNNNNN are the genotypes where NNNNN is the internal perlegen ID, whose variable name is 'snp_id' in the dataset 'markers.sas7bdat' described below. * The genotypes were converted from "A T" format to "A/T", and I made sure these were reported alphabetically, so it would read "A/T" instead of "T/A". * If an "N" occurred in the genotype this variable was set to missing (""). * The remaining variables are phenotypic and were derived from the dataset perlegen_all_v4.sas7bdat which was downloaded from the WU NICSNP web site. markers.sas7bdat ---------------- 1 observation per SNP. * All of the variables except for final_cg_snp_list and final_sel_cand_genes were taken from a file we received from Perlegen. These variables are documented in NIDA_IG_snp_info.doc. * The variable final_cg_snp_list is binary (0/1) and indicates if the SNP is on the final list of SNPs selected for candidate genes. This information represents an update to the information found in the variable sel_cand_genes. The variable final_sel_cand_genes indicates which genes the SNP was selected for, except when the value is "NIDA_CHOSEN" which indicates the SNP was selected for other reasons. data_cg.sas7bdat ---------------- Data for SNPs selected for candidate genes only according to the variable 'final_cg_snp_list' in the markers dataset. The format of this dataset is the same as data.sas7bdat.