In the field of human genomics a Copy Number Variant (CNV) is a segment of the genome in which more or less DNA is present with respect to a reference, considered healthy. If there is more DNA with respect to the reference, the CNV is referred to as duplication, otherwise as deletion.
It has been shown that many genetic disorders, especially those related to intellectual disability can be explained with deletions and duplications in relatively large segments of the human genome.
Detecting CNVs has always been an important task in research and clinical genetics due to the importance that duplications and deletions might have in specific cases.
With the advent of Next Generation Sequencing, NGS, it is possible to detect CNVs by counting the number of reads in a specific area of the genome and compare those to the healthy reference, similarly to what has been performed with array technology so far.
Many issues have been added to the task of detecting CNVs. Namely, biological noise (usually referred to as variable GC content), and depth of sequencing, that is the minimal number of reads used to sequence the genome of one individual.
While a higher number of reads leads to deep sequencing and more reliable results, shallow sequencing is less reliable but more affordable in terms of costs to the final user.
In order to widely adopt NGS for the clinics, the depth of coverage has been kept to a minimum. Many tests, such as Non-invasive Prenatal Testing, NIPT, in which NGS technology is used to detect chromosomal aberration of large chunks of the genome (eg. trisomy 13, 18, 21), limit the number of reads to 8-10 millions, in order to make a test affordable to the majority of the population.
It comes without saying that less reads can lead to the formation of gaps in the sequencing, or deletions which might be extremely hard to detect as such. As a matter of fact, a deletion could be real (lack of DNA with respect to reference) or just the result of not enough reads in that specific region.
Bayesian statistics might help in such conditions.
Here is a method I started working on a while ago.
If we assume that reads are generated (accumulated at a specific genomic location) by a Poisson process, it should be possible to estimate the rate of a Poisson distribution by the number of reads in a window.
Hence a fixed-size window slides across the genomic region to analyse, and the rate of the Poisson distribution is estimated each time. In fact, two rates are estimated, one for the case and one for the control (the reference against which we are comparing).
If there is consistent difference between the two rates, a structural variant is detected (duplication or deletion according to the sign of the difference), otherwise the two samples are both considered normal.
Determining if the difference of two rates is high enough to consider the samples as different, is another statistical test, or can be empirically estimated from known samples.
The Python code that generates synthetic CNV data and detects duplications and deletions is provided on github. A screenshot is provided below.
In the first two rows, reads are generated across the genomes of the reference and the case. In the third row the truth is provided. The two estimated rates are provided in the forth row. The difference of the rates is provided in the last row.
The difference of the rates can be also estimated as posterior distribution directly from the data. For reasons that are still unclear, that does not seem to perform well.