The University of Arizona

Simulation of FP methods  

Home | Search | FPC | Contact Us

This study was funded in part by NSF grant #0211851 and by the Initiative for Future Agriculture and Food Systems Grant no. 2001-52100-11292 from the USDA Cooperative State Research, Education, and Extension Service.

Contents:
      Methods
      Simulation Results
      Discussion

Methods

Xu et al. (2004) Genomics 84: 941-951 compares the following methods using simulations based on human chr22:
Method
Enzymes
Band Size Range (bp)
Gel length1
Tolerance (Xu,Ours)2
1e
HindIII
600 to 16k
3300
Variable,7
2e
HindIII, HaeIII
58 to 773
(773-58) * 1 = 715
2,2
3e3
HindIII, BamHI, HaeIII
35 to 500
(500-35) * 10 * 1= 4650
2,4
4e
HindIII/HaeIII; HindIII/RsaI; HindIII/DpnI
75 to 500
(500-75) * 10 * 3 = 12750
5,4
5e
HindIII, BamHI,XbaI,XhoI,HaeIII
35 to 500
(500-35) * 10 * 4 = 18600
2,4
1The paper did not specify the gel lengths used in the FPC configurations, so we have computed them from the band values (gel length = total # of possible band values).
2We used bands instead of sizes for 1e, hence, the tolerance is different. For the 3 methods that use sequencing machines, we used 0.4 bp estimated by ourselves and Luo et al. for the ABI 3700/3730.
3In the 3e method, as specified in the paper, only HindIII receives a label.

The paper concludes, based on the human chr22 data only, that the 2e and 3e methods are to be preferred, which is counterintuitive because 4e, 5e contain more information. We have extended the simulations to a large number of other sequenced chromosomes and find that, as expected, the 5e method is superior, with 4e a fairly close second.

Our simulations followed the exact methods enumerated above except for two differences. First, we simulated agarose (1e) using band data rather than size data, and a fixed tolerance of 7, as this is more customary and matches the error obtained in realistic scoring. Second, we used tolerance 4 for all of the methods specifying detection by automated sequencer (3e,4e,5e), corresponding to our practice with data from 3730xl machines.

We note that we were not able to fully reproduce the results of the paper using the information provided. The gel lengths were not stated and these have a significant impact on cutoff values and other simulation outcomes; see Discussion for additional discrepancies.

Simulation Results

  • Human Chr22 comparison: We compare the human22 simulations of the paper with our own, using both the parameters stated in the paper and the parameters which we derived using the methodology stated in the paper.

  • Full Simulations, tabulated by coverage or species: fully-automated simulations on 12 different pseudomolecules, using 4 different coverages, 3 different build criteria, and 2 different random libraries for each case.

  • The simulation software used may be downloaded here. FPC v8.2 or later is required.

    Discussion

    The "by coverage" tables show that 5e is clearly superior when measured by the number of contigs formed (or F-, which is virtually the same). The other measures, e.g. F+ or Q, arise in unpredictable quantities as soon as there are badly-formed contigs. A given false contig may contain many Q, or in some cases none at all. Every false contig contains at least one F+ clone overlap, but sometimes there will be many more. These measures fluctuate between the different methods and show no clear distinction between them. The combined "Map Score" which is defined in the paper (and displayed in the "by species" tables) is therefore a combination of an informative score (contig #) with several essentially random scores that obscure the signal.

    Our simulated digestions differed significantly from those reported in the paper in two cases. In the 3e case, we obtained 47.5 bands/clone as compared to 71.7 cited in the paper, and our figure seems more reasonable to us since it is approximately the same as the 2e method. For the 4e method, we obtained 110 bands/clone, compared to 73.8 in the paper; in this case also, our value seems reasonable to us since it is approximately proportional to the number of labeled 6-cutters.

    We were also unsure how Xu et al. defined the parameter F-, which is shown as zero for all entries of their Table 1. There must be at least one F- (i.e., missing) overlap for every contig break, and even if these are not counted in F-, there are generally additional F- because of bridging clones.

    Several of the test cases, including human22, have false overlaps (F+) already at very stringent cutoffs (the others are human19, human20, and arab1). In human22 this is caused by a 128kb repeat (interrupted by 56 gaps) which generates false overlaps in all 5 methods. HICF could be disadvantageous in such cases, as the false overlap will be detected at lower cutoff, and furthermore the gaps in the duplicon are less likely to intersect the small HICF fragments and differentiate them. This phenomenon deserves further study, but our simulations do not indicate any disadvantage for HICF even in these cases.

    Note that none of these tests attempt to simulate error. The error rate of HICF fingerprinting in maps we have constructed to date (whole-genome cereal assemblies) is higher than that of a well-scored agarose project, but the higher throughput and greater contig formation in HICF considerably outweighs this disadvantage.

  • Email Comments To: will@agcol.arizona.edu or cari@agcol.arizona.edu

     

     

     

    Last Modified Thursday February 14, 2008 10:47 AM and 21 seconds