SyMAP System Guide

Find synteny between two sequenced genomes with optional annotation.

Draft sequence ordering by synteny, i.e. align a draft genome to a fully sequence (not draft-to-draft).

Find synteny between an FPC map and sequenced genome.

For multiple selected synteny pairs, display using dot plot, circular, and side-by-side.

Complete annotation-based queries with construction of cross-species gene families.

Publications

Steps for finding synteny

1.	Use a Linux or Mac machine.	It needs to have Java v1.8 or later, and sufficient processing power. See system requirements.
2.	Set up MySQL.	See MySQL.
3.	Download SyMAP.	Installation is a simple unzip. See installation.
4.	Run the demo.	Highly recommended. See running the demo.
5.	Prepare sequences and annotation.	Sequences can be in one or many files and can be masked or unmasked. See preparing the sequences. Annotation format is gff3; see annotation files.
6.	Load the files into SyMAP.	The SyMAP interface makes this easy; see creating a new project.
7.	Compute alignments and synteny.	This is also easy through the SyMAP interface. See runtime and memory.
8.	View results.	Detailed description of the user interface is in the User Guide.

For viewing alignments, CPU and memory needs are typically negligible, unless you are performing queries on more than 4-5 genomes at once.

Installation

The first time you run SyMAP, it will create the database with information written to the terminal, e.g.

The Project Manager window opens showing the four demo projects provided with the SyMAP packages. Check "Demo-Seq" and "Demo-Seq2".

A link "Load All Projects" will be displayed in the top of the right panel; select it to load the projects, which will take several minutes. If loading the "Demo-Seq" takes more than a few minutes, you may need to adjust the MySQL parameters, see TroubleShoot MySQL. When done, the Manager will look as shown in the image on the right. In the "Available Syntenies" table, click the cell for the "Demo_Seq2" row and the "Demo_Seq" column. Then click the "Selected Pair" button to start the alignment.

The Synteny & Alignment takes less than 5 minutes on the MacOS 10.5 but could take up to 30 minutes on a slow machine. When done, the table will have a checkbox, signifying that the synteny is available for viewing. Click the checked cell, which will enable the viewing buttons.

Click "Summary" to view the v5 summary shown on the right; there may be slight difference in the number of anchors because the results are slightly different when run when different numbers of CPUs. To view the other interfaces, see Demo Query.

Ordering the demo draft

MySQL and parameters

The MySQL installation does not need to be on the machine where you will do the computations or view the results, as long as it is on an accessible network. Once the server is ready, fill out the database parameters in the symap.config file in the main SyMAP directory, as described below.

Important Note: The default settings of MySQL are poorly suited for large-scale data storage. You will want to adjust the parameters innodb_buffer_pool_size and innodb_flush_log_at_trx_commit as described in Trouble Shoot MySQL.

Database Parameters

Runtime and Memory

The largest component of SyMAP execution time is in running MUMmer^1,2. The time and memory for MUMmer all depends on the size of the genomes. For example, to align rice (12 chromosomes, 370Mb) to maize (10 chromosomes, 2Gb) required 1 hour and 3 minutes using 8 CPUs with 2.3Ghz speed. SyMAP used one CPU per maize chromosome to align the enter rice chromosome against each of the 10 maize chromosomes (i.e. used 10 CPUs).

The memory usage of MUMmer is typically 5G per CPU, however it can be as high as 10G for very long or repetitive chromosomes. If MUMmer fails, it is often due to insufficient memory, see the MUMmer document, which explains how to determine the problem and ways around it.

To create a new project via the SyMAP interface, press the "Add Project" button at the lower left. Enter the project name beside "Name:"; if the project type is "fpc", change the dropdown beside "Type:". The Help button on the dialog provides further information. The following will discuss a type "sequence" project and everything is similar for the "fpc" project.
After saving the new project, it appears in the Projects list on the left, but it is still an empty shell. A directory will be made under the `/data/seq`, e.g. for the project added on the right, a directory will be created called `/data/seq/foobar`. Check its box and it will appear in the Summary section (right hand side).

Parameters

Preparing the Sequences

Another masking option which is available if you have gene annotation is to mask out everything but the annotated genes. You can enable the "mask_all_but_genes" option on the Project's Parameters window (shown above); turn it on before doing the alignments.

Note that sequence files should be in FASTA format and the name of a sequence is the string immediately following the ">", e.g.

Three things are important in naming sequences for SyMAP:

A.	Sequence names can contain only letters, numbers, and underscores.
B.	The sequence names must exactly match those used in the annotation files (first column), or the annotations will not be loaded.
C.	Use a consistent prefix such as "Chr" for all sequences, followed by a short number; set 'grp_prefix' to the prefix in Project's Parameters window (shown above). If there is not a consistent prefix, you may leave the 'grp_prefix' blank; beware, this can have unintended results, so should be avoided if possible.

Annotation files

The last column (attributes) contains "tag=value" pairs describing the annotation. You can set which attributes to use, or use all those occurring more than a certain number of times; open the Project's Parameters window (shown above), look for parameter "annot_keywords".

Important: Only the annotation on "gene" entries is shown in the displays or used for searching. For entries "exon", "CDS", "gap", and "centromere", only the coordinates from the gff file are used; the annotation text is not read.

NCBI files: A Java script (scripts/ConvertNCBI.class) has been provided that converts NCBI genome fasta files and gff annotation files into the format that works best with SyMAP. See the documentation for instructions.

Ensembl files: A Java script (scripts/ConvertEnsembl.class) has been provided that converts Ensembl genome fasta files and gff3 annotation files into the format that works best with SyMAP. See the documentation for instructions.

Draft sequence

If you are not ordering the draft sequence, and if the draft sequence is in too many sequence pieces, then (1) it takes a long time for the MUMmer comparisons, (2) the display is very cluttered, and (3) the blocks display does not work right. Limit the number of sequence pieces by setting min_size in the parameters window to only load the largest 150 sequences; there is a script called scripts/lenFasta.pl which will print out all the lengths; set the min_size to the 150th length. However, even 150 are a lot of blocks to view so you might want to start with the largest 50, merge them, then repeat.

Self Alignments

Working with FPC Files

Creating an FPC project is the same as for a sequence project except that you choose the type "fpc", and then the Project Parameters window has some different parameters. The Parameters window is where you will enter the FPC file, and your fasta files of marker and BAC-end sequences.

Note that the BAC-end sequence names must be exactly the clone names used in FPC, with extension "r" or "f" labeling the strand. In other words, if the FPC map has a clone "a0435B26" then the BES for that clone can be named "a0435B26f" or "a0435B26r".

The BES and marker alignments in an FPC project are performed using BLAT³. The running time is typically several times longer than that of MUMmer (described here), but the memory usage is much lower.

Alignment:
The sequences are written to disk^*, with gene-masking if desired. In the alignment, one species is "query" and the other is "target". If one project is FPC, that is the query; if both are sequence, the query is the one with alphabetically the first name. The query sequences are written into one large file, while smaller target sequences are grouped into larger fasta files of size up to 60Mb, for more efficient processing in MUMmer. There is an option "Concat" that if unchecked, both query sequences are treated the same as the target; this is useful if the query and target are very large genomes.

Anchor Clustering:
The raw anchor set consists of the hits found by MUMmer or BLAT. These are first clustered into gene, or putative-gene hits. This is done by clustering the hit regions on each sequence, and then defining new "gene" hits which connect these regions. For example if three separate exons hit between two genes, they will be clustered into one "gene" hit having a combined score equal to the sum of the raw hit scores. Clustering is by gene if the hits overlap annotation, otherwise, it uses a max separation 1kb, creating "putative gene" regions.

Anchor Filtering:
The clustered "gene anchors" are now filtered using a version of reciprocal-best filtering which is adapted for retaining duplications and gene families. For each pair of genes (or putative genes) which is connected by a clustered anchor, the retained anchors must be among the top two anchors by score on both sides (top-2 allows for one ancestral whole-genome duplication). An anchor will also be retained if its score is at least 80% of that of the 2nd-best anchor on each side (this allows for retention of gene family anchors). These filter parameters may be adjusted through the Alignment & Synteny Parameters window.

Synteny Block Detection:
After the clustered anchors are loaded into the database, the synteny synteny block algorithm runs. This algorithm looks for approximately-collinear sequences of anchors, subject to several parameters including (A) Number of anchors; (B) Collinearity of the anchors; (C) Amount of "noise" in the surrounding region (to help reject false-positive chains). Criterion A can be adjusted in the Alignment & Synteny Parameters window.

^* Note that the sequences are re-written from the database to the disk for three reasons: (A) To allow re-grouping for efficiency; (B) To ensure elimination of invalid characters; (C) To mask non-gene regions, if desired. This also ensures that sequences names will match those in the database, and prevents problems caused by moving the source sequences on disk.

Contents

Overview and Publications

Publications

Getting Started

Steps for finding synteny

System Requirements

Installation

Running the Demo

Ordering the demo draft

New Project

MySQL and parameters

Database Parameters

Runtime and Memory

Creating a New Project

Parameters

Sequence project

Preparing the Sequences

Annotation files

Draft sequence

Self Alignments

Directory structure

Alignments and external programs

Alignment executables

MUMmer with SyMAP details

FPC project

Working with FPC Files

How SyMAP Works

References