Tuesday, August 21, 2007 - 01:25
If you've read the previous posts on this topic, here and here, you're probably aware by now that I have this weird (okay, maybe fanatical) obsession with data. Or at least, with knowing if my data are right so I can get on with life, do the analysis and figure out the results. My results from last week suggested that re-processing chromatogram data (from the ABI 3730) with phred was probably a bad idea, but still, I only had one data point and I really wanted to know if anyone had done a more thorough study and compared larger numbers of chromatograms. Naturally, someone had.
tags:And of course it was ABI. And, the results aren't even new (except to me, I guess). ABI and their collaborators at the Washington University and Baylor College of Medicine genome centers presented this work in a poster at the Advances in Genome Biology and Technology (AGBT) meeting in 2004 at Marco Island (1). They looked at basecalling performance with data from 20,000 chromatograms and concluded that:
- 1. KB produced fewer errors.
- 2. KB was able to call more bases, which resulted in longer reads.
These box and whisker plots show the results from chromatograms that were basecalled with the KB basecaller (on top, in blue), chromatograms from ABI instruments (without KB) that were re-processed by phred (in the middle, in red), and chromatograms that were first processed with KB, and then with phred (green, on the bottom) (this was the method that I used the other day with my one chromatogram). In each case, they compared the read sequences that were obtained with a reference sequence in order to determine the error rate. (What is a read? A read is a DNA sequence that's been obtained from a chromatogram file. The chromatogram file has lots of extra information like the kind of matrix, the run time, the name of the base calling program, the peak heights, etc. A read sequence only contains the sequence of bases: ATAGAGCTCATCGATCATCTACGTA.... etc. ) We can evaluate reads in a few ways.
- We can look at the number of high quality bases (Q20, Q30, Q40).
- We can look at the length of the read after trimming off the bad stuff.
- And, we can compare the read to a known sequence and count the number of differences.
..since phred replaces (and ignores) the initial called sequence, re-processing KB-analyzed samples with phred will, on average, degrade the accuracy of the analysis in terms of actual sequence error. Analysis improvements provided by KB algorithm outlined above will be essentially lost.There you have it, the end of this read and this sequence of posts at the same time. Time to move on to the next generation. Reference: 1. Gehman, C. et. al. 2004 "Longer Reads with the KB Basecaller" AGBT 2004. 2. Applied Biosystems User Bulletin, FAQ KB Basecaller v1.2.