• Analysis of the pattern and trend of human genomic variations in the form of single nucleotide polymorphisms (SNPs) and small insertions and deletions (INDELs)

      chundi, vinay kumar; Department of Biological Sciences
      Single nucleotide polymorphisms (SNPs) and small insertions/deletions (INDELs) are the most common genetic variations in the human genome. They have been shown to associate with phenotype variation including genetic disease. Based on data in a recent version of the NCBI dbSNP database (Build 150), there are 305,651,992 SNPs and 19,177,943 INDELs, and together as all small sequence variants, they represent approximately 11% of the human reference genome sequences. In this study, we aimed first to examine the characteristics of SNPs and INDELs based on their location and variation type. We then identified the ancestral alleles for these variants and examined the patterns of variation from the ancestral state. Our results show that the occurrence of small variants averages at 104 SNPs/kb and 6.5 INDELs/kb for a total of ~11% of the genome. Chromosome 16 and 21 represent the least and most conserved autosomes, respectively, while the sex chromosomes are shown to have a much lower density of SNPs and INDELs being more than 30% lower in the X chromosome and more than 85% lower in the Y chromosome. By gene context, SNPs are biased towards genic regions and INDELs are biased towards intergenic regions, and further, INDELs are biased towards protein-coding genes and intron regions within the genic regions and SNPs are biased towards non-coding genes in the genic regions. Within the coding regions, SNPs and INDELs are biased towards missense and frameshift variations, respectively. Some of the biases were due to biased sources of the variation data targeting at genic regions, while the bias towards intron regions is due to selection pressure. Further, genes with the highest level of variation showed enrichment in functions related to environmental sensing and immune responses, while those with least variation associate with critical processes such as mRNA splicing and processing. Through a comparative genomics approach, we determined the ancestral state for most of these variants and our results indicate that ~0.79% of the genome has been subject to SNP and INDEL variation since the last common human ancestor. Our study represents the first comprehensive data analysis of human variation in SNPs and INDELs and the determination of their ancestral state, providing useful resources for human genetics study and new insights into human evolution.