Differences Between JCVI Aug 2011 and sift-dna.org Nov 2011 database
Pauline Ng, January 2012
My group had released previous versions of the SIFT genome databases at JCVI (November 2009 and earlier versions). I moved from JCVI to GIS in 2010.
In Aug 2011, JCVI independently released a SIFT database (SIFT 4.0.3b).
My assessment of JCVI Aug 2011 (4.0.3b) database:
- Prediction accuracies are similar between the JCVI and sift-dna.org databases
- JCVI has lower coverage due to Ensembl-only gene annotation.
- Some bugs remain unfixed in JCVI database. These bugs were fixed in the sift-dna.org database:
- dbSNP annotations are incorrect, which can lead to missing disease mutations (This is important if you are looking at rare disease/ rare variants).
- chrY predictions are missing in JCVI database.
- Predictions on stops is incorrect
If you would still prefer to use the JCVI's databases and not our more recent versions of SIFT, this is what I'd recommend :
- Do not pull SIFT predictions directly out from the database, use SIFT_exome_nssnvs.pl script in the SIFT package. (Masks stop bug)
- Do NOT use dbSNP annotation from the SIFT databases (May miss your disease mutation). Or if you do, write additional scripts to check allele.
- Also, please note lack of chrY predictions.
SIFT Predictions on Amino Acid Substitutions
Predictions from JCVI Aug 2011 database had expected accuracy (based on HumDiv and HumVar datasets). Performance of JCVI with HumDiv and HumVar were
similar with sift-dna Nov 2011 database.
Scores will differ between JCVI and sift-dna databases because 1) scores are highly dependent on the protein sequence database used to retrieve
homologous sequences, and different protein databases were used and 2) different SIFT versions were used. 1) is a major factor, 2) is a minor factor,
where SIFT 4.0.4 used a newer version of BLAST, but the core algorithm did not change.
Scores are not expected to correlate between the two databases because scores are scaled probabilities and so a score of 0.25 has the same
interpretation as a score of 0.75. Therefore, correlation is not a relevant metric.
sift-dna.org has higher coverage due to addition of RefSeq, Ensembl, and CCDS predictions. The figure below shows that sift-dna.org has annotation for an additional 1.95 million missense positions and chrY predictions are missing from JCVI's database.
Bug for Stop Variants
*JCVI Aug 2011 database contains predictions based on <= 3 sequences, which are low-confidence. These were not included in sift-dna Nov 2011 database, which is why chr2 and chr19 appear higher in JCVI Aug 2011.
There was bug in previous versions of SIFT databases where SIFT predictions for stop codons were included. These predictions are incorrect, and are derived as artifacts from one of my programs. This bug was masked if you used the standalone script that pulls out SIFT predictions (SIFT_exome_nssnvs.pl) However, if you pull the predictions directly from the database, do NOT use predictions on stops as they are meaningless. In our most recent database, we have removed incorrect SIFT predictions for stop codons so there will be no confusion. This was not corrected in JCVI's August 2011 database.
Pauline Ng Jan 2012