Aim
Detection of somatic variants in next generation sequencing (NGS) data is important for cancer research. While many dedicated somatic variant calling algorithms are available, comparisons between these callers showed significant discrepancies in variant detection, thus extensive and expensive validation of variants may be required to exclude false-positives. Improved confidence in variant detection may be achieved by using multiple callers, but requires significantly longer processing. We aimed to improve the accuracy of somatic variant calling and to limit the requirement for prolonged processing time by using optimised filters for variant calling algorithms.
Method
Whole exome sequencing data of 10 matching tumour/normal samples from chronic myeloid leukaemia patients was analysed using 7 published somatic variant callers. Individual components (pre-processing read filters, statistical model, and post-processing site filters) of each caller were assessed for their effectiveness. Optimised filter sets were applied to single caller results to improve the confidence of variant calling.
Results
A total of 39936 variants were detected in the 10 samples, but only 443 variants were called by by 6 or 7 callers (>95% validation rate, High Confidence variants), and the vast majority (39069 = 98%) were called by only 3 or fewer callers (<1% validation rate, Low Confidence variants). Applying our filtering method at low stringency setting, we were able to remove most of the LC variants (down to 2378) while retaining most of the HC variants (431). Filtering at high stringency setting, only 81 (0.2%) LC variants remains, but 409 (92%) of HC variants were retained.
Conclusion
Through systematic analysis and optimisation of filters, we have demonstrated significantly improved accuracy of single-caller somatic variant detection as well as overall consensus between callers. Application of appropriate filters to a limited number of callers will reduce the requirement for extensive validation and long data processing time in cancer research projects involving NGS data.