Supplemental information to provide detailed information on the data sets and experiments used to compare different strategies of electronic marker addition, using the search tools MegaBLAST (Zhang et. al., 2002), BLAST (Altschul et. al., 1997) and BLAT (Kent, 2001).
To add markers to a map, one uses an alignment program to return hits of the marker sequences against sequence associated to the clones in the map (this can be draft, finished, or bac-end sequence (BES)). Two primary decisions must be made: which program to use, and how to rank the resulting hits. These are discussed in sections I and II below, for a target of draft or finished sequence. BES targets are covered in section III.
In order to compare different programs and hit-ranking approaches, we have used rice markers provided by the Japanese Rice Genome Research Project(JGRP). The set consisted of 3444 ESTs from 2191 cDNAs, and 338 STSs with 263 unique ids. The average EST sequence length was 421, while the STS average was 325.
Our target database of clone sequences was the full set of draft sequence for rice, as of Feb. 2003, consisting of 11126 sequenced contigs from 3497 clones. These were downloaded from Genbank with the query
Oryza [ORGN] AND (30000[SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC [ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR EST [TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL] OR protein[TITL]))(this query was created by L. Teytelman at Gramene: http://www.gramene.org/documentation/Alignment_docs/rice_rflp.html)
Our BES database consisted of 127457 rice BESs of clones in the rice FPC map (Chen et al. 2001; Mao et al. 2000).
We scored hits as true-positive or false-positive, depending on whether the hit was to the correct chromosome, where the chromosome of the marker was known from the JGRP map, while that for the clone was known from the Genbank record (for draft sequence) or the FPC map (for BES). This is an admittedly simplistic measure, but if bad hits are randomly distributed, it should only mislabel a bad hit as good 1/12 of the time (since there are 12 chromosomes). To check whether bad hits might be concentrated on the "correct" chromosome, 10 hits labeled as true-positive were examined manually, using the detailed map location of the given marker and neighboring markers, and in all cases the hit was confirmed.
Another consideration is that the markers are not entirely specific; i.e., they may hybridize in the lab to chromosomes other than the one mapped; if these extra hits are also found electronically, they will be scored as false-positives according to our "wrong chromosome" test. Because of this, the error rate does not go to zero at high stringencies, but goes to a "base rate" which is determined by the specificity of the markers.Only error in excess of this base rate can be attributed to the search strategy. For the EST markers in our test set this rate is about 9%, while for the STS markers it is much lower.
In a traditional BLAST search, hits are ranked according to the BLAST E-value, which measures the probability of that hit occurring by chance. This is a sensible measure to use for similarity searches, where one wants to find any target sequences that probably have the same function or evolutionary origin as the target.
When adding electronic markers, however, one isn't looking for similarity, but for identity. The marker sequence is supposed to be exactly present in at least one location in the genome. A perfect match of, say, half of the marker sequence, is not the kind of hit one is looking for, even though it could have a very low E-value.
For this reason, it seems sensible to score electronic marker hits according to the percentage of the total marker length which is matched; this score will be denoted "%Match". (Note that this is not the same as the "percent identity" from BLAST reports, because that score only gives the fraction of bases that match within the alignment region, which may not cover the whole marker.)
The %Match strategy is clearly seen to be superior to E-value in tests on the STS subset of JRGP markers (table I below). On this dataset, using MegaBLAST, a screen on %Match ≥ 95 finds 462 hits, with a false-positive rate of 2%;by contrast, taking E-value ≥ 170 gives only 179 hits, with 3% false-positives.[ %Match was implemented by parsing the output file and taking %Match = (match length/query length) *(percent identity) .]
It is understandable that E-value should work poorly in this case, because the sequences have a variety of lengths, and a given E-value is only well-suited for a small range of lengths. E-170 corresponds on this dataset to a match of about 300bp, and since the average STS marker has length only 325, there are many which are shorter than 300. Hits of these markers will never be found at E-170; however, if the E-value cutoff is lowered to accept the shorter markers, it becomes too low for the longer ones, and false hits are introduced. For example, marker L704 has length 268, and scores a 100% hit on clone AC018929, but this hit has E-value only E-149. To accept this hit, the E-value cutoff would have to be lowered at least to E-149, at which point the cumulative false-positive rate is 6%, twice as high as at E-170.
Another problem with E-value screens, which may not be sufficiently appreciated, is that the value returned for a given alignment varies with the size of the databases being compared. The larger the databases from which the two sequences are drawn, the larger will be the returned E-value. This means that hits which are accepted at a given E-value cutoff, could be rejected if the database sizes are increased, even if the two sequences are unchanged. So for example if one runs an E-value screen of some markers against batches of sequence that come in periodically over time, and then runs the same screen later against the full set of sequences, the set of hits will be different. By contrast, the %Match strategy is independent of database size, so this potential confusion is absent.
From the data in table I, these STS marker sequences are seen to be very accurate, so that one can expect to find typically at least 95% of the marker sequence in the target genome; this is the ideal situation for the %Match strategy. If the marker sequences are less accurate, or contain bad reads at either end, then the advantage of %Match over E-value will be reduced or eliminated. This effect can be seen within the EST marker set, which also has the additional complication of spliced introns, as discussed in the next section.
It is worth mentioning that not all of the markers get a hit at %Match ≥ 95, or indeed at any cutoff. At 95%, hits are found for 220 unique markers, or 83% of the total 263 markers in the set. Lowering the cutoff to 90% brings the marker total to 235, while going all the way down to 70% brings in only 5 more for a total of 240. So it seems that about 10% of the markers simply do not hit, which is not unreasonable given the gaps in clone coverage and in draft sequence, and possible errors in marker sequencing.
Finally we note that although MegaBLAST data was used in this comparison, BLAT gives the same result; furthermore, in the context of BSS, the %Match scoring is supported only through BLAT (the reason for this, as discussed below, is that %Match requires BLAT in order to work well on EST data). MegaBLAST data was used here merely to emphasize that it is the scoring strategy, not the program, which is being tested.
Supplemental Table I. %Match Strategy vs. E-value strategy for STS-to-sequence data.
|
%Match Strategy |
|
|
|
|
E-value Strategy |
|
%Match |
F.P.(Cumulative) |
Hits |
|
|
E-value |
F.P.(Cumulative) |
Hits |
100% |
1% |
131 |
|
|
200 |
3% |
134 |
99 |
2 |
288 |
|
|
190 |
3 |
134 |
98 |
2 |
366 |
|
|
180 |
3 |
140 |
97 |
2 |
420 |
|
|
170 |
3 |
179 |
96 |
1 |
453 |
|
|
160 |
6 |
234 |
95 |
2 |
462 |
|
|
150 |
6 |
258 |
94 |
2 |
473 |
|
|
140 |
8 |
325 |
93 |
2 |
479 |
|
|
130 |
10 |
387 |
92 |
3 |
495 |
|
|
120 |
20 |
482 |
91 |
5 |
511 |
|
|
110 |
28 |
592 |
90 |
5 |
512 |
|
|
100 |
31 |
651 |
Marker sequences which have undergone splicing present a problem for the %Match strategy, because although the full marker sequence may be present in the target, it is in disconnected exons. BLAST (and MegaBLAST) returns each exon hit as a separate high-scoring pair, but from the report it is not straightforward to put those pieces together to produce the total %Match.
However, there is a search tool which does this automatically - BLAT. Given this capability, and the superiority of the %Match strategy seen above on STS data, it is reasonable to believe that BLAT + %Match will be superior on the EST data. In fact we will see that it is, but the margin of difference is much smaller than on the STS data.
Tables II and III display the EST-to-sequence data for BLAT and MegaBLAST, respectively. (In table II, the maximum allowed intron was taken to be 1000, i.e., all hits with any intron larger than 1000bp were excluded; see below. Note that, before these tests were run, all Poly(A) tails, including any embedded n's, were stripped from the markers. Also, MegaBLAST was run without any filtering, i.e., using command-line parameter -F F.) Supplemental Table II. BLAT data, EST-to-sequence, maximum intron = 1000
F.P.(Cumulative) |
F.P.(Running) |
Hits |
Intron Hits |
F.P.(Introns) |
|
100% |
9% |
9% |
402 |
192 |
7% |
99 |
9 |
9 |
1561 |
765 |
9 |
98 |
8 |
5 |
2473 |
1173 |
5 |
97 |
8 |
10 |
3029 |
1436 |
11 |
96 |
8 |
12 |
3452 |
1639 |
12 |
95 |
8 |
7 |
3734 |
1766 |
7 |
94 |
8 |
5 |
3943 |
1866 |
3 |
93 |
8 |
10 |
4065 |
1917 |
3 |
92 |
8 |
11 |
4183 |
1985 |
7 |
91 |
9 |
17 |
4307 |
2048 |
19 |
90 |
9 |
8 |
4397 |
2087 |
5 |
89 |
9 |
10 |
4463 |
2123 |
8 |
88 |
9 |
13 |
4523 |
2149 |
15 |
87 |
9 |
16 |
4577 |
2179 |
26 |
86 |
9 |
19 |
4628 |
2192 |
30 |
85 |
9 |
16 |
4664 |
2213 |
23 |
84 |
9 |
24 |
4705 |
2230 |
41 |
83 |
9 |
12 |
4755 |
2253 |
8 |
82 |
9 |
20 |
4803 |
2274 |
47 |
81 |
9 |
20 |
4833 |
2290 |
25 |
80 |
9 |
31 |
4878 |
2317 |
33 |
79 |
9 |
13 |
4923 |
2339 |
13 |
78 |
10 |
32 |
4948 |
2349 |
30 |
77 |
10 |
10 |
4988 |
2378 |
13 |
76 |
10 |
33 |
5027 |
2392 |
28 |
75 |
10 |
32 |
5058 |
2410 |
38 |
74 |
10 |
41 |
5087 |
2427 |
23 |
73 |
10 |
57 |
5125 |
2445 |
33 |
72 |
11 |
25 |
5156 |
2458 |
23 |
71 |
11 |
50 |
5182 |
2473 |
53 |
70 |
11 |
52 |
5201 |
2483 |
60 |
E-value |
F.P.(Cumulative) |
F.P.(Running) |
Hits |
200 |
7 |
7 |
2116 |
195 |
7 |
0 |
2116 |
190 |
7 |
0 |
2116 |
185 |
7 |
0 |
2116 |
180 |
7 |
9 |
2157 |
175 |
8 |
12 |
2319 |
170 |
8 |
14 |
2481 |
165 |
8 |
11 |
2624 |
160 |
8 |
6 |
2771 |
155 |
8 |
7 |
2910 |
150 |
8 |
11 |
3046 |
145 |
8 |
5 |
3181 |
140 |
8 |
15 |
3296 |
135 |
8 |
12 |
3428 |
130 |
9 |
13 |
3552 |
125 |
9 |
11 |
3723 |
120 |
9 |
17 |
3897 |
115 |
10 |
21 |
4069 |
110 |
10 |
18 |
4249 |
105 |
10 |
21 |
4401 |
100 |
11 |
27 |
4561 |
95 |
11 |
22 |
4704 |
90 |
13 |
53 |
4934 |
85 |
14 |
34 |
5069 |
80 |
14 |
28 |
5221 |
For purposes of comparison, we can take the lowest cutoffs in each table which still give some fixed (cumulative) false-positive rate. Using 8% false-positives, we see that BLAT with %Match ≥ 92 finds 4183 hits, while MegaBLAST with E ≥ 135 finds 3428. Using 9%, BLAT with %Match ≥ 79 finds 4923, while MegaBLAST with E ≥ 120 finds 3897.
Hence, on this data, BLAT + %Match finds 20% more hits than MegaBLAST, with equal error rates. From table II it appears also close to 50% of the hits found by BLAT contain (putative) introns, here defined as hits containing at least one target gap larger than 20bp; also, the running false-positive rate for these intron hits separately is not higher than the total rate, if %Match ≥ 90. At %Match < 90, the running false-positive rates for intron hits do rise, suggesting that 90 is a good cutoff for %Match on this data.
Looking in more detail at the results for 8% cumulative false-positives, it turns out that BLAT finds 1145 true hits that MegaBLAST misses, of which 894 contain introns. However, MegaBLAST still finds 492 true hits that BLAT misses. The average length of these excess MegaBLAST sequences is 497 bases, larger than the 421 bp average length of all the EST sequences; this suggests that possibly the problem is low-quality sequence remaining at one end. Since the 3' ends were trimmed in most cases to a Poly(A) tail, the 5' ends are suspect.
In order to test this hypothesis, 5% of the marker sequence was removed from the 5' end of each marker, and BLAT was run again. On this data, total error 8% required only %Match ≥ 90, and the hits found by BLAT but not MegaBLAST increased to 1276, while those found by MegaBLAST and not BLAT decreased to 353. This bears out the hypothesis, but blindly cutting off larger pieces of the sequence doesn't work; the trimming must be done as accurately as possible at the time of sequencing, if the markers are to realize their full value.
It was mentioned above that the BLAT hits were screened for a maximum intron size of 1000. This number obviously must be chosen to suit the particular genome in question; for rice, with typical intron size 400-600bp (Sasaki et. al. 2002; Y. Yu, pers. comm.), a limit of 1000 seems reasonable. Our EST data supports this, because a higher intron size limit brings in few new hits, while quality declines considerably. For example, taking %Match ≥ 92, but considering only hits with maximum intron size 1000 or more, one finds only 465 additional hits, and the cumulative false-positive rate is 14%, as compared to 8% for the original set with maximum intron size 1000.
Before moving on we again discuss the false-negatives, i.e., the markers which do not hit. Using BLAT, with only the Poly(A) tails trimmed, and %Match ≥ 92, one finds hits to 1767 unique markers, or 81% of the total set of 2161. Carrying the search to %Match ≥ 70 brings the total to 86%, and %Match ≥ 50 gives 88%. So again roughly 10% of the markers are missing, a figure not out of line with the known gaps in coverage.
Aligning marker sequences to short target sequences such as BESs presents a problem for the %Match strategy, since usually only a piece of the marker sequence can be expected to hit, and this is especially true for EST markers containing introns. The usefulness of BLAT's intron-unsplicing capability is likewise also reduced. Our dataset did not produce enough hits for conclusive results, but it appeared that both of the two strategies considered above perform the same.
ReferencesAltschul, S. , Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. 1997. Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs.Nucleic Acids Res. 25:3389-3402.
Chen, M., Presting, G., Barbazuk, W., Goicoechea, J., Blackmon, B., Fang, G., Kim, H., Frisch, D., Yu, Y., Higingbottom, S., Phimphilai, J., Phimphilai, D., Thurmond, S., Gaudette, B., Li, P., Liu, J., Hatfield, J., Sun, S., Farrar, K., Henderson, C., Barnett, L., Costa, R., Williams, B.,Walser, S., Atkins, M., Hall, C., Bancroft, I., Salse, J., Regad, F., Mohapatra, T., Singh, N., Tyagi, A., Soderlund, C., Dean, R., and Wing, R.2001.An integrated physical and genetic map of the rice genome.Plant Cell 14:537-545.
Kent, W. James. 2001.BLAT - The BLAST-Like Alignment Tool.Genome Research 12:656-664.
Mao, L., Wood, T.,Yu, Y., Budiman, M., Woo, S., Sasinowski, M., Goff, S., Dean, R. and Wing, R. 2000. Rice Transposable Elements: A Survey of 73,000 Sequence-Tagged-Connectors (BESs). Genome Research 10:982-990.
Sasaki, T., Matsumoto, T., Yamamoto, K., Sakata, K., Baba, T., Katayose, Y., Wu, J., Niimura, Y., Cheng, Z., Nagamura, Y., et al.2002. The genome sequence and structure of rice chromosome 1.Nature 420:312-316.
Zhang, Z., Schwartz, S., Wagner, L., and Miller, W.2002. A Greedy Algorithm for Aligning DNA Sequences. Journal of Computational Biology 7:203-214.