Quantifying Genetic Diversity
Our approach to quantifying the genetic diversity associated with a catalog of genes
relies on genetic "vectors" which are assembled per gene from the genotypes of 4-8
polymorphic SNPs located within the genes under investigation. Within the scope of
this study we used 100 specifically selected genes that had previously been hypothesized
to be relevant in the context of psychiatric disorders. Specifically, "m" SNPs per
gene will result in 2**m-dimensional genetic vectors, where the length of the vectors
can vary from gene to gene. The maximum possible number of genotypes for a gene with
"m" SNPs is then 4**m. However, because SNPs located within a gene are in many cases
strongly correlated, the actual number of different genotypes observed in the
population of interest is much smaller. It depends on the particular gene, on the
SNPs chosen to make up the respective genetic vector, as well as on the population
studied (number of observations, biological ethnicity). When used to resolve subtle
differences in population structure, a gene with a large number of observable
genotypes is more informative than a gene with just a few genotypes. In other words,
variation means information. The genotypic diversity in our study was found to be
almost infinite (in the order of 100**100), so that it was not at all straightforward
to establish the anticipated link with psychiatric disorders..
Learning to Recognize
Once a set of genetic vectors is available for sufficiently representative samples of
the populations under investigation, methods of Artificial Intelligence (AI) can be used
in order to detect genotype patterns that are unique to a population and contribute to
discrimination between populations ("supervised learning"). Likewise, the same
methodological framework can be used to develop a model of biological ethnicity
("unsupervised learning"). There is a critically important caveat: the genetic
vector method is very sensitive to missing data in the SNPs, as these cause the
"noise level" to increase unacceptably after a certain point.
Normative Data
When comparing populations in terms of genetic diversity, it is essential that the results
are corrected for any differences in sample size. We created the prerequisite for such
corrections by analyzing our total sample (n=1,698) with respect to genetic diversity
using 32-fold repeated random sampling for subsamples of size 50 - 1,500 and in steps of 50.
The following Table shows the expected values regarding genetic diversity for 10 genes
and sample sizes ranging from 100 to 1,000. Due to the well-behaved characteristics
of the underlying functions, extrapolation is possible for population sizes beyond n=1,698.
References
Stassen HH, Bridler R, Hell D, Weisbrod M, Scharfetter C: Ethnicity-independent genetic basis
of functional psychoses. A Genotype-to-phenotype approach. Am J Med Genetics B 2004; 124: 101-112
Berger M, Stassen HH, Köhler K, Krane V, Mönks D, Wanner C, Hoffmann K, Hoffmann MM, Zimmer M,
Bickeböller H, Lindner TH: Hidden population substructures in an apparently homogeneous
population bias association studies. Eur J Hum Genetics 2006; 14: 236-244
Stassen HH, Hoffmann K, Scharfetter C: The Difficulties of Reproducing Conventionally Derived
Results through 500k-Chip Technology. BMC Genet Proc. 2009; 3 Suppl 7: S66