Detecting Copy Number Variants with Bayesian statistics

In the field of human genomics a Copy Number Variant (CNV) is a segment of the genome in which more or less DNA is present with respect to a reference, considered healthy. If there is more DNA with respect to the reference, the CNV is referred to as duplication, otherwise as deletion.
It has been shown that many genetic disorders, especially those related to intellectual disability can be explained with deletions and duplications in relatively large segments of the human genome.

Detecting CNVs has always been an important task in research and clinical genetics due to the importance that duplications and deletions might have in specific cases.
With the advent of Next Generation Sequencing, NGS, it is possible to detect CNVs by counting the number of reads in a specific area of the genome and compare those to the healthy reference, similarly to what has been performed with array technology so far.

Many issues have been added to the task of detecting CNVs. Namely, biological noise (usually referred to as variable GC content), and depth of sequencing, that is the minimal number of reads used to sequence the genome of one individual.
While a higher number of reads leads to deep sequencing and more reliable results, shallow sequencing is less reliable but more affordable in terms of costs to the final user.
In order to widely adopt NGS for the clinics, the depth of coverage has been kept to a minimum. Many tests, such as Non-invasive Prenatal Testing, NIPT, in which NGS technology is used to detect chromosomal aberration of large chunks of the genome (eg. trisomy 13, 18, 21), limit the number of reads to 8-10 millions, in order to make a test affordable to the majority of the population.
It comes without saying that less reads can lead to the formation of gaps in the sequencing, or deletions which might be extremely hard to detect as such. As a matter of fact, a deletion could be real (lack of DNA with respect to reference) or just the result of not enough reads in that specific region.

Bayesian statistics might help in such conditions.
Here is a method I started working on a while ago.
If we assume that reads are generated (accumulated at a specific genomic location) by a Poisson process, it should be possible to estimate the rate of a Poisson distribution by the number of reads in a window.
Hence a fixed-size window slides across the genomic region to analyse, and the rate of the Poisson distribution is estimated each time. In fact, two rates are estimated, one for the case and one for the control (the reference against which we are comparing).
If there is consistent difference between the two rates, a structural variant is detected (duplication or deletion according to the sign of the difference), otherwise the two samples are both considered normal.

Determining if the difference of two rates is high enough to consider the samples as different, is another statistical test, or can be empirically estimated from known samples.

The Python code that generates synthetic CNV data and detects duplications and deletions is provided on github. A screenshot is provided below.




Screenshot of BayesCBS, a Bayesian CNV detection software.


In the first two rows, reads are generated across the genomes of the reference and the case. In the third row the truth is provided. The two estimated rates are provided in the forth row. The difference of the rates is provided in the last row.

The difference of the rates can be also estimated as posterior distribution directly from the data. For reasons that are still unclear, that does not seem to perform well.

More on this later.


Leaving Academia, joining Industry. Here is my story


I’ve been into academia for a long time now. Maybe too long.
I studied Computer Engineering at Politecnico di Milano, in Italy. After the master’s degree I landed in the world of academic research and started my PhD at KULeuven, Belgium. It took me a bit less than five years to complete the full program with the intentions to move to industry afterwards. At the time I was working on virtualization technology, the basic block of cloud computing. But an offer “I could not reject” made me accept a position as Postdoctoral Researcher (which I never understood if it should be capitalized or not) that was meant to last two years plus (the usual) renewal. Continue reading

Do not call it agile programming

Agile programming

I have been involved in multidisciplinary projects in human genetics for a number of years, since the end of my PhD. As a data scientist and scientific computing expert, I have been working with people with my own expertise and, even more often with those who have almost no computing skills. For the record, people who can happily nest a number of for loops to crunch datasets of several gigabytes and who can keep demanding for new hardware whenever they realize their programs are taking ages to complete.

In almost all the scenarios I have been involved, there is The Guy who has a hypothesis, and a bunch of other guys arguing about it, demolishing it if needed, and writing the code that eventually places that hypothesis within real world or dreamland.

A number of steps are usually required during this process:
1) a brief description of the problem – that, if done correctly, solves half of it – followed by 2) an explanation of the data at hand. This usually suggests the initial steps of the analytic pipeline, namely data cleansing, wrangling and normalization.

The real analytic phase follows the initial one with a discussion of the strategy to represent the data and the specific algorithm to solve the problem. This can be very specific to the research questions one tries to answer, and allows to choose between clustering or decision trees or neural networks or SVMs or any other algorithm from the machine learning shelf.

One common part to any group I have been working with, regardless of the problem to solve and the algorithm to use, is The Guy’s feedback.
Ideally, data scientists, data engineers and developers work in a highly connected environment in which they give feedback to each other and take into consideration any correction, very promptly.
However, in many realistic cases, The Guy is the only actor in charge of determining the pace of the analysis, giving feedback to the rest of the team, sometimes with a delay.
This hypothetical (un)structured team is doomed to fail, giving rise to code that must be reviewed very frequently, that is not engineered, quite unreadable, stuffed with bugs and definitely not well structured.

Some people refer to all this as agile programming, and apparently they look very happy about it.
I have experienced something more interesting, that I call acrobatic programming. And frankly, I am not happy at all.


My Python code repository on Github