Epilogue

With our careful planning and implementation, we are able to furnish a timely analysis of our large project. Our primary interest here is not a detailed account of the hardware and timing for our project but the approach we implement. We have focused on single-point analysis, and the problem can be more complex when multipoint analysis is involved.

Our approach builds on our considerable experiences of the computer systems over years. Given that the SAS system is widely available, our work will be welcome. We also noted some caveats association with SAS. While the system was designed to handle large data, the implementation of its procedures is heterogeneous. For instance, we noted in general, PROC SQL is better t han DATA step and PROC PRINT and can perform more sophisticated data management tasks. We do not necessarily need to segment the data in order to use PROCs MEANS and FREQ. On the other hand, the SAS/GENETICS procedures often run into memory problems, but fortunately SAS has many alternative ways to do the same task. As the usage of disk space is quite heavy, it will be more useful to enable SAS/GENETICS procedures to read phenotype data separately. Over years SAS language has been enriched but relatively stable and its powerful macro facility is also an extra advantage to many other software systems. We are yet to develop into other types of analysis available, e.g., principal component analysis and partial least squares method for structured association, cluster analysis for study of relatedness and outliers, covariance structure modelling for pathway analysis, just to name a few.

The software we developed is largely applicable to those with moderate resources and both Linux and Windows systems. We have also started and will continue our experiments on other software systems including Stata (http://www.stata.com), R (http://www.r-project.org), standalone programs such as SNPGWA (http://www.phs.wfubmc.edu/web/public_bios/sec_gene/downloads.cfm), and commercial software such as BC/SNPmax (http://www.bcplatforms.com). In general, grid computing using clusters is now the state-of-the-art alternative to supercomputers, which is well-documented and will be part of our next experiment. SAS, Stata, R all have facilities to support multiple processors. They are also generic and not specific to genetic data. An additional feature common to these system is the support of open database connectivity which allows for synchronous access to databases. The comparison of these systems however will be more appropriate in a separate contribution.