The University of Arizona
FPC Draft Sequence Functions  
Home | Search | FPC | Contact Us

FPC Sequence Track Tutorial & Demo

W. Nelson and C. Soderlund, Feb. 2008

Table of Contents

  1. Overview
  2. Alignment using BSS
    1. Using the BSS
    2. Placing the Sequences
    3. Viewing the sequence track
    4. Detailed alignment view
    5. Detecting sequence merges
  3. Alignment using embedded BES
    1. SeqInfo and BES file formats
    2. Placing the Sequences
    3. Detecting contig merges
    4. Detecting misassemblies
  4. Additional error detection functions
  5. Placement algorithm and parameters

1. Overview

The FPC sequence track
allows easy alignment and visualization of a draft sequence assembly against an FPC clone map of the same species. This alignment permits you to check both sequence and FPC assemblies and locate possible sequence or contig merges.

Two representations of draft sequence are supported:
• Flat: Single sequences (often with gaps indicated by N's.)
• Scaffolded: Sequence contigs, with scaffold information in an auxiliary file.

There are also two methods of aligning the sequence to the FPC map:
• BSS: By aligning the sequences against BES, using the BSS function of FPC
• Embedded BES: Using BAC-End sequences which were incorporated into the assembly.

This demo illustrates all four possibilites.

Note that BES must be named as clonename+r or clonename+f. In other words, if your clone name is a0001B13, then the BES would be a0001B13r and a0001B13f. (A ".r" and ".f" ending will also work.)

Note that for the flat representation of the draft sequence, the software does not specifically recognize the gaps notated by N's. If your sequences have gaps which you wish to see visually in the display, you must use the scaffolded representation.

Once the alignments are computed, they are displayed in a new "sequence track" within the contig view (for a demo on manipulating the FPC tracks, see Track Tutorial"). A "detailed alignment" view is also available, which opens a separate window showing each individual BES hit between the sequence and the FPC contig.

Several built-in error detection functions are provided to extract information from the alignments, including potential misassemblies, sequence joins, contig joins, sequences and contigs which did not align, and BESs whose location conflicts with an alignment.

To begin on the demo, download the package seqdemo.tar.gz. Save it to an empty directory and unpack it with the commands

	gunzip seqdemo.tar.gz
	tar xvf seqdemo.tar
A new directory "seqdemo" is created. Change to this directory and verify that you see the following listing of files:
BES/  BSS_results/  demo.cor  demo.fpc  info_files/  Seq/

Back to top

2. Alignment using BSS

Start the demo by typing:
fpc demo.fpc
2.1 Using the BSS
Click the "BSS" button on the main menu window to open the BSS window. (If you are not familiar with BSS, you may wish to first read through the BSS tutorial.) (1) For the Query, select "Browse", select the directory Seq, then select the file demo.seq. (2) For the Database, select "Browse", select the directory BES, then select the file demo.bes. (3) The Search tool should be set to "Megablast", then set "E-val" 1e-200. Your BSS window should look as follows:

Now click "Start search" to run the search.

When the search is complete, the last line on your terminal will be:

On average, each file had 6722 hits.
Open the BSS results file demo.demo.bss by double-clicking this name under the "Output file" label. This top part of the Results window will look as follows:

Since we are aligning draft and BES from the same species, we expect near-perfect alignments, so we will apply two filters in order to use only the hits that have percent identity > 98% and percent match > 97% (i.e. at least 98% of the BES sequence must be aligned).

On the Results window, select "Analysis" at the top, then select from the menu "Filter hits". From the Filter window, set the "Numeric" option to "Identity > 98" (i.e. change "SeqCtg" to "Identity", change "<" to ">", and enter 98). The filter window should look as follows:

Now press the "Apply Filter" button; 1197 results remain.

Next set the Numeric options to "Match > 97" and apply. Now 1119 hits remain.

Close the Filter window. Save the results by selecting "File" at the top of the Results window, then select "Save BSS" from the File menu (note, it will prompt you to confirm, select Yes). Close all the BSS windows.
Back to top

2.2 Placing the sequences
On the FPC main window, click "Draft" to open the Draft Sequence window, which looks as follows:

Press the BSS button and load the BSS file demo.demo.bss which we just created. The output to the terminal window indicates that 1119 hits were loaded.

Now that the individual BES hits are loaded, we can compute the alignments of the draft sequences to the FPC contigs. For the Demo, we will raise the "Min BES hit" from its default value of 5 to 10. (see below for further explanation of the meaning of the parameters). After doing this, press the "Place Sequences" button to place the sequences.

When the placement is done, the Project window appears showing the results of the placement:

To match your display to this image you will need to scroll the view to the right. Also, click on the "draft_ctg_1" label to view the clones that it aligns to.

The track shows the sequence segments which align to the contig above; in this case, there are two segments from draft_ctg_1 and one from draft_ctg_2 (not shown). The start and end locations along the sequence are indicated at the ends, in kb, and the total sequence length is shown beneath the center. Reversed alignments are indicated by a "REV" notation, and for these alignments the sequences are drawn in a reverse direction, i.e. with the starting base the right.

The individual hit regions along the sequence are drawn in red; non-aligning regions are drawn in gray. Gaps in the sequence are shown as narrower, darker lines (an example will be seen below). If the alignment does not reach to the end of the sequence, a gray arrow is drawn at the end, and the next aligning FPC contig (if any) is labeled beneath the arrow; clicking on the arrow opens this contig view (examples below).

Clicking on one of the sequence lines highlights the clones whose BES are hit by that sequence. (Only those hits which were computed to be part of this alignment region will be highlighted). Clicking again on the sequence brings up the sequence information window for that sequence.

As mentioned, the BESs must follow the naming convention for the FPC sequence alignment functions to work properly. BES must be named as "clonename.r" and "clonename.f".
Back to top

2.4 Detailed alignment view
A more detailed view of the alignments, showing each individual hit, is available by right-clicking on the empty area of the sequence track. On the right-click menu, select "Show detailed alignment" and the detail view window opens:

The Project window shows all the potential sequence merges which were found. Each one is listed next to the FPC contig in which it was indicated. In this case there is only one potential merge, notated on contig 1 as
SEQJOIN draft_ctg_1 draft_ctg_2 246
This means the the draft_ctg_1 appears to overlap draft_ctg_2, with an overlap amount of 246 cb units.

The output file seq_joins.txt contains a table with the same information, which can be imported into a spreadsheet.
Back to top

3. Alignment using embedded BES

The alignments we have looked at so far were generated by importing a BSS file of blast hits between the BES and the draft sequence. The other way to generate alignments is through BES which have been directly incorporated into the draft sequence. In this case, information files must be prepared describing the scaffolds and the locations of the BES within the scaffolds. For this demo, the information files are in the info_files directory.
3.1 SeqInfo and BES file formats
We will first look at the scaffold information file demoSEQ.info, which specifies the sequences we will be using and their scaffolding arrangement. It reads:
>draft_ctg_3  19444618
ctg3_1 0 13000000
ctg3_2 13100000 6344618

>draft_ctg_4  1732668
This specifies "draft_ctg_3" as a scaffold of length 19.4 Mb, having two sequence contigs "ctg3_1", "ctg3_2", starting at 0 and 13.1 Mb, respectively. Note that there is a 100 kb gap between the two.

The sequence "draft_ctg_4" has no sequence contigs specified, meaning that FPC will treat it as a single sequence contig rather than a scaffold.

The BES location information is contained in the file demoBES.info. It is too long to display in full but we can understand its format by looking at the beginning of one of its sequence sections:

>ctg3_1  13000000
a0023K14.r      12102062        955
a0097D17.r      12102072        684
b0013E04.f      12104259        915
a0090M03.f      12104688        693
This section is listing BES located in the sequence contig "ctg3_1", which was previously defined in the scaffold information file. For each BES, the list shows its clone name, with .r/.f appended; its starting position in the sequence contig; and its length.

Note that BES must follow the naming convention shown here for the FPC sequence alignment functions to work properly, i.e. BES must be named as "clonename.r" and "clonename.f".
Back to top

3.2 Placing the sequences

Now let us load these alignments. It is important to load the scaffold file before loading the BES location file; otherwise, the sequences referred to in the BES file will not be recognized.

Press the "Seq Info" button, select the info_files directory, and then the demoSEQ.info file. The output to the terminal window reports

Loaded 2 supercontigs, 2 seqctgs
Now press the "BES" button, select the info_files directory, and then the demoBES.info. The output to the terminal window reports
Loaded 2908 BES map entries, 1090 found in FPC
Note that the BES file contains quite a few BES for clones which are not in this FPC map. This is often the case since the set of clones with successful fingerprints typically differs from the set having successful BES.

Press "Place Sequences". The Project window appears once again showing that sequences were placed to contigs 2,3 and 4, as shown below. Note, the sequences placed earlier still exist, i.e. in contig 1; if you had wanted to remove them first, you can do so with the "Remove Sequences" button.

Look at contigs 3 and 4 and note draft_ctg_3 spans them both. On contig 3, the draft_ctg_3 line has an arrow to contig 4 on the left end; clicking this arrow opens the contig 4 window. The draft_ctg_3 alignment to contig 3 also shows the 100 kb gap, which is drawn as a thin, darker segment of the sequence line. The following shows the bottom part of the contig window.


Back to top

3.3 Detecting contig merges
As just mentioned, draft_ctg_3 spans FPC contigs 3 and 4. The contig 3 alignment ends at 12.57 Mb on the sequence, while the contig 4 alignment begins at 12.44 Mb, suggesting that these FPC contigs are separated by 130 kb. It is possible that they should be merged within FPC.

The Ctg Joins function on the Draft Sequence window indentifies such potential contig merges and tabulates them to an output file, and to the Project display. Since 130 kb is a fairly large gap, this merge would not be suggested with the default maximum gap setting. The maximum gap is given by the Seq_FromEnd parameter, which is set to 100 kb by default. For purposes of demonstration, we will raise the Seq_FromEnd to 150 kb. After doing this, press the "Ctg Joins" button and accept the default output file name ctg_joins.txt. When the function is done, the Project window appears again, showing the indicated contig merges and the sequences which provide the evidence:

Of course one should only perform these merges if the fingerprints at the ends of the contigs in question actually overlap. Merges using both sequence and fingerprint evidence can be performed automatically using the Ends->Ends function on the Main Analysis page, with the "Seq Confirm" option checked. This option restricts the usual Ends->Ends functionality to consider only merge candidates with sequence confirmation.
Back to top

3.4 Detecting misassemblies
One of the primary reasons to align draft and FPC assemblies is to identify possible assembly errors that can be reflected in the alignments in one of two ways:

• Multiple alignments of one sequence to a contig. For correct assemblies, with reasonable coverage of BES, alignments should be contiguous.
• Alignments which terminate in the interior of both sequence and contig. For correct assemblies (and good BES coverage) alignments should only terminate by reaching an endpoint of either contig or sequence.

The Misassemblies function on the Draft Sequence window scans for both types of error. Press the Misassemblies button and accept default output file name misassembly.txt. The Project window appears indicating problems on contig 1,2 and 3:

Contig 1 has a multiple alignment from draft_ctg_1 (labeled 'MULTI'). Contigs 2 and 3 show alignments with inconisistent terminations ('BADTERM').

Unfortunately, it more difficult to determine whether the error lies in the contig or the draft sequence. Detailed investigation is needed, for example trying to break the FPC contigs apart using more stringent cutoffs.
Back to top

4. Additional Error Detection functions

Three large buttons near the bottom of the Sequence window provide additional data about the alignments:
  • Unaligned FPC Contigs Outputs a table (and keyset) of FPC contigs having no sequence alignments. The "Min Clones" parameter restricts the list to contigs larger than this size. For each contig, the output table also lists the BES in that contig which are incorporated into draft contigs.
    NOTE: for this extra information to be included, a list of all available BES must be loaded through the "Seq Info" function (top of the Sequence window). In other words, make a file with format
    >a0001A02.r     770
    >a0001A03.f     770
    >a0001A03.r     759
    
    where the list includes all BES which exist, and the numbers are their lengths (they don't have to be right for this purpose). When you load them through "Seq Info" they will be added to the set of draft sequences, allowing the analysis functions to be aware of their existence.

    Press the "Seq Info" file and load the allBES.info file from the info_files directory. (This made-up file lists both .r and .f BES for every clone in the project, all with the same length.) Now press the "Unaligned FPC Contigs" button, accept the default filename, and note that the contig display for Ctg5 appears. This is because Ctg5 is the only unaligned contig in this case. Had there been more than one, a keyset of contigs would have appeared instead.

    The output file lists each unaligned contig, along with all of its BES which are incorporated into a draft sequence.

  • Unaligned Draft Contigs Outputs a table (and keyset) of sequence contigs which have segments that did not align to any FPC contig. The "Min Length" parameter restricts the list to segments longer than this size.
    The table output also lists the embedded BES contained in the segments (if any), along with the FPC contig they are in, and the number of bands in the clone fingerprint. This information helps determine where the segment should align or whether the segment comes from a region having unusually few or many restriction sites.
    In addition, the clones containing the BES are remarked with an FP remark of the form "unaligned:<sequence name>". By using the clone search functions on these remarks, you can get a keyset of the unaligned BES in any of the sequences.

    Press this button and accept the default output filename. A keyset appears showing that draft contigs 2,3,4 have unaligned segments. The exact segments are identified in the output file.

  • Misplaced BES Outputs a table of BES which are incorporated into draft sequence and which, based on the alignment, should be in a particular FPC contig, but in fact are located elsewhere on the FPC map. The clones with misplaced BES are given an Fp_remark starting with "MISPLACED BES"; this allows you to get a keyset of these clones, highlight them, etc.

    Running this function on the demo project shows that the BES for clone a0102D05 are misplaced. They are incorporated into sequence draft_ctg_4, which aligns to FPC Ctg2, but in fact the clone is located in FPC Ctg5. Hence, the BES are placed either in the wrong sequence or the wrong FPC contig.


Back to top

5. Placement algorithm and parameters

The sequence-to-contig alignments are computed using a sliding-window algorithm to identify significant clusters of BES hits. A window of size "window_size" (default 250 kb) is moved down the length of the FPC contig (its size is converted into contig CB units using the Band size parameter on the FPC Configure window). Windows which contain at least "min_bes" (default 5) distinct BES hits are saved as candidates (hits to 5' and 3' BES are considered distinct for this count). For each contig-side window, the hits in the window are collected, and the same sliding window analysis is applied on the sequence side,using this subset of hits. This produces a collection of window pairs, with each pair representing at least "min_bes" hits between regions of maximum size "window_size" on the sequence and FPC contig. The window pairs are then merged to construct the final alignment regions. (Specifically: an adjacency graph is constructed identifying the overlapping window pairs, i.e., the pairs for which both the FPC and sequence-side windows overlap. The window pairs in each connected subgraph are then merged.) The "Top N" parameter restricts the number of alignments returned for a given sequence to N (or unlimited, if the parameter is zero).

 

Email Comments To: fpc@agcol.arizona.edu

 

 

 

Last Modified Monday April 12, 2010 16:24 PM and 47 seconds