SyMAP System Parameters

The original SyMAP was written for diverse plant genomes with short introns, but has been modified to work for the long introns of mammalian genomes, and less diverse genomes.

1. Build database

Additional
information

1.a Start SyMAP

To start SyMAP, type at the command line: ./symap

To view the command line options: ./symap -h

For the first time user of SyMAP, see:

System Guide for setup and system requirements.
Demo to run the demo (do not skip this step!).
Input for file formats.
Terminology.

Project Manager

1.b Load Project

As shown in the Project Manager above, selected projects from the left panel will be listed in the Selected section on the right. The possible functions vary with the state, as listed below:

♦ If there is any selected project not loaded to the database, you will see:
Load All Projects	Load all projects that have not been loaded yet.
♦ If a selected project is not* loaded in the database, you will see:*
Remove from disk	Only: Remove alignment directories from disk All: Remove alignment and project directory from disk Remove alignments removes alignments from data/seq_results for this project. You will be prompted for each one to confirm you want it removed. If there are no alignments, you will only see the prompt Remove project to remove the project directory from disk. Remove project directory remove data/seq/<project-name> from disk. You will be prompted to confirm you want it removed. If removed, the project will no longer be shown on the left.
Load project	Loads the sequence and optional annotations to the database. When loading is complete, always verify by selecting the View link, which provides a summary of what has been loaded (e.g. View Arabidopsis).
Parameters	This brings up a panel of parameters, see Project Parameters. After the project is loaded, you can still change the Display parameters.
♦ If a project is loaded into the database, you will see:
Remove from database	The projects and its synteny pairs will be removed from the database, but the files stay on disk.
Reload project	Only: reload project only. All: reload project and remove alignments from disk. If All is selected, it first prompts for each alignment directory for this project before it is removed. It removes the alignments, but it leaves the params.txt file. You will need to remove the alignment(s) if (1) there is a change in sequence, or (2) there is a change in the Minimal length parameter; see Load project parameters. For either option, it executes Remove from database followed by Load project.
Reload annotation	Removes the annotation from the database then load the annotations. This does not effect the alignment, so they do not have to be redone. The Align&Synteny commands will recognize if there are existing alignments for the pair and use it for the clustering and synteny computations.
Parameters	This brings up the Project Parameters panel.
View	This bring up a panel of a summary of the loaded results, e.g. View Arabidopsis.

For any action that will remove the project or alignments from disk, a popup will occur to confirm that you want this done! If it will be removing multiple alignment directories, it will prompt on each one.

1.c Available Syntenies

Sequence alignments are performed with MUMmer3, but can be changed to use MUMmer4 (see SyMAP MUMmer).

This section shows a table with the status of alignments between the selected loaded projects. Each cell in the table represents a pair of projects and the cell contains a status code showing whether or not that pair has been aligned (codes are listed below). Note that the table shows each pair cell twice, but only the lower cells are activated.

Clicking on a cell selects that pair of projects (the cell will be highlighted in green), and the buttons that can be selected are activated.

Code	Description
✔	Synteny for this project pair is ready to view.
A	The MUMmer alignment has been performed but the synteny computation has not been run. This status occurs if a pair is completed but then annotations are re-loaded for one of the projects, or if the MUMmer files have been added by the user.
?	The alignment have not been completed. In this case, select Selected Redo and the alignments will be completed followed by the synteny computation.
	The alignment has not been started.

See Pair Parameters for additional information on the Available Syntenies and codes.

1.d Align&Synteny

Align&Synteny (A&S): Synteny usually implies Cluster&Synteny as the process has 2 distinct algorithms.
Selected Pair Selected Redo	Run (or complete) the A&S computation for the selected project pair. If the pair is already complete, the button label changes to Redo, and only the Cluster&Synteny will be rerun (see Pair Parameters for some variations). If you wish to rerun the MUMmer alignment, first use the Clear Pair function.
Clear Pair	Only: remove synteny from database All: remove synteny and alignments from disk for this pair If you have changed the Alignment parameters, or loaded new sequence for one of the projects, you need to have the alignments removed and redone; otherwise, you can just remove the synteny from database.
Parameters	Set the pair parameters for the selected pair cell .

For the display buttons, see User Guide.

1.e CPU and Verbose

CPUs:

Enter the maximum number of CPUs to use for the alignments.
For example, if there are 8 alignments to be done, it will perform 4 at a time.
Alternatively, the number of CPUs may be entered in the symap.config file or
using the command line argument ./symap -p.

Verbose checkbox:

If checked, detailed summary information is written as it processes the MUMmer files. The information is written both to the terminal and the logs/<proj-to-proj>/symap.log file.
If this is not checked, it will write status information repeatedly on the same terminal line.
See Demo examples.
This can also be turned on using a command line argument ./symap -v.

Neither of these are saved with the Pair parameters Save.

1.f Additional information

MUMmer	Resolving problems if MUMmer does not run
Draft	Ordering draft sequence
Self	Self-synteny
Cancel	Cancelling an A&S

2. Load Project Parameters

Parameter
panel

Display

Load
annotation

2.a Parameter panel

The Selected section of the Manager

Click the Parameters link for a project to open the parameters panel shown on the right.

Make sure these two parameters are correct before running the alignment: Minimum length, Sequence files. These are described in Load project.

Sequence and Anno (annotation) files:
If the files are put in the default locations, nothing needs to be entered for them. Default locations:

  data/seq/<project-name>/sequence
  data/seq/<project-name>/annotation

See input for a description of the input files.

2.b Display

Most of the values in the parameter's Display section are shown in the Selected section of the Manager. New values take immediate effect on Save and are saved in the symap_5/data/seq/<project-name>/params.txt file.

Parameter	Description	Default Value	Used In
Category	Category label for the project. This is only used to group projects on the left side of the Manager panel. Category labels must be composed of only letters, numbers, dash, underscore, or period. Either select an existing label from the drop-down or enter a new one in the text box. Do NOT enter the same label with different capitalization -- it may mess-up.	Uncategorized	Selected
Display name	A user-friendly name for the project. Shorter names will work better in the displays. Names must be composed of only letters, numbers, dash, underscore, period. It must be unique over all case-insensitive Display names and project-names.	project-name	Selected and all displays
Abbreviation	A name must be <= 5 characters. Names must be composed of only letters, numbers, dash, underscore, period. Uniqueness is not required over other Abbreviation. It can be the same as the corresponding Display name or project-name. However, it must be different between projects that are compared in Queries.	First 5 characters of Display name	Queries column headings and Pair Parameters
Description	Description of the project. Do NOT use quotes, backslash or #.	New project	Selected and View
Group type	How to refer to the sequences.	Chromosome	Selected
Anno key count	This applies to the annotation attributes columns shown in the Queries results table. See the GFF Attributes section below.	50	--

2.c Load project

The following parameters are under the Load project section of the parameters panel.

Group prefix
The term "Group" is used for any FASTA sequence type, e.g. chromosome, scaffold, contig. This option sounds trivial but is important for a good display, so please read carefully the following.

1.	When a Group prefix is entered, it allows SyMAP to remove the prefix from the chromosome names and use the remaining part as a shorter name, e.g. '1' instead of 'chr1', as shown on the right; the top chromosome images do not have it 'chr' removed and the bottom do.
2.	If the sequence file has a mix of prefixes (e.g. Chr01, Scaf2345): If a Group prefix is entered, only sequences with that prefix will be loaded. Leave the Group prefix blank will load all sequences; their prefix will not be removed.
3.	Long names really clutter the display; it is best to use a consistent prefix that SyMAP can remove, otherwise, use really short prefixes, e.g. 's' for scaffold, 'c' for chromosome.
4.	You may remove the prefix after the project is loaded. For example, if your sequences had names "Chr1", "Chr2", etc, and "Chr" was NOT entered as the Group prefix before load, you may enter it later and it will be removed from all sequence names. However, this is not reversible, a prefix cannot be added to the sequence names.
5.	This is case-insensitive.

This parameter is finicky; after loading a project, select View for a popup of the input, and check the output to the terminal to make sure the annotation was loaded right. Also, see xToSymap as it may help you create the files with good prefixes.

Minimum length
This must be an integer, commas are allowed (e.g. 1,000,000).
This is the minimum length of the FASTA sequence that will be loaded; smaller sequences will be ignored. Note that annotations for ignored sequences will also be ignored, but some warning messages will be printed to the terminal. See xToSymap Length for help with setting this parameter.

Sequence files
Select the input FASTA sequence file(s) or directories of sequence files. For formatting, see Sequence files. Default location: data/seq/<project-name>/sequence

If any either the Sequence files or Minimal length parameters are changed:

If the project has already been loaded, Reload Project.
If A&S has previously been run, select Clear Pair and remove the alignment files, then run A&S.

2.d Load annotation

The following parameters are under the Load annotation section of the parameters panel.

Anno keywords
A comma separated list of keywords. This can be used to reduce the annotation attribute keywords shown in the 2D display and Queries table, as described in the GFF Attributes section below.

Anno files
Select the input GFF3-formatted annotation files corresponding to your sequences. Note, using a GFF3 file directly can cause problems if it does not conform to what SyMAP expects; see Annotation files. Annotation is optional but highly recommended. Default location: data/seq/<project-name>/annotation

If either Anno keywords or Anno files is changed:

If the annotation has already been loaded, Reload Annotation.
If A&S has been previously run, re-run A&S (the existing alignments files will be reused).

2.e GFF Attributes

This section gives details on what GFF attributes are displayed in SyMAP, which refers to them as annotations.

The gene annotation is shown on the 2D display and as columns in the Queries results table. The attributes (annotations) comes from the last column of the GFF file. The attributes are a keyword=value list, e.g.

   ID=gene-AT1G01010;Name=NAC001;ID=rna-NM_099983.2;product=NAC domain containing protein 1

Defaults: Generally, all genes in a file have the same keywords, in which case, use the defaults. This will cause the entire attribute to be shown for the gene in the 2D display, and the Queries table will have columns for each keyword that has over Anno key count (default 50) occurrences. In the example above, the columns will be ID, Name and product (the second ID will be ignored).

If there are many different keywords in the attribute list, this causes too many columns in the Queries table. This can be reduced by one (or both) of the following:

Anno key count: If there are many different keywords in the attribute list, set this count N to filter out all keywords with <N occurrences. The Anno key count can be modified at any time using symap (not viewSymap).

Anno keywords: The keyword=value pairs to be saved for each gene can be limited by listing the desired keywords separated by commas. Using this approach, it will also reduce annotation description per gene in the 2D display. Referring to the example above, if the string "ID, product" was entered for Anno keywords, the Name=value would not be part of any gene annotation. This must be set before Load Annotations is executed.

2.f Saving project parameters

On Save and before loading the project, the parameters are saved to:

  data/seq/<project-name>/params.txt

Save also saves the parameters in the database.

The params.txt file parameters are shown on the Project Parameters panel. These can only be viewed and changed using symap (not viewSymap). Do not edit params.txt with a standard editor.

3. A&S Pair Parameters

Parameter
panel

Alignment

Synteny

3.a Parameter panel

The Available Syntenies section explains the table in the lower right. The following provides more information in the context of the pair parameters.

The table on the right has cells that have the following completed:

Value	Alignment	Synteny
blank	No	No
A	Yes	No
✔	Yes	Yes

Alignment will not be redone if the cell contains an A. This is important because MUMmer is very time-consuming, but the synteny computation is not (see timing results); hence, one can make changes to the cluster or synteny parameters and re-run without redoing the alignments.

Select a pair cell in the Available Syntenies table followed by the Parameters button, which will popup the panel shown on the right.

If a ✔ or A in pair cell exists and the parameters for a section are changed, do the following:

Changed Section	Action
*Alignment*	Select Clear Pair to remove the existing alignments, then Selected Pair.
*Cluster Hits*	Use Selected Redo directly; the existing alignments for the pair will be used.
*Synteny*	If only synteny parameters are changed, click the lower-left drop-down and select Synteny Only (see drop-down).

If the Align&Synteny has already been run, the last row will have an extra drop-down, as shown below (see drop-down):

The parameters are described in the following 3 sections: Alignment, Cluster hits and Synteny.

3.b Alignment

Preparing the sequences

Parameter

Description

Default

Concat

Concat checked:

•For the 1st genome, sequences are concatenated into a file as long as the file length is <1G. Multiple files of maximum file length 1G may be created.

•For the 2nd genome, the same as above except the maximum file length is 60M.

•All files from the 1st genome are searched against all files from the 2nd genome. This results in fewer MUMmer alignments, which can be faster.

Concat unchecked: To reduce memory usage, you can uncheck Concat so that multiple files <60M are created for each genomes. This results in more MUMmer alignments, which can be slower.

The exception is self-synteny, where all chromosomes are written to their own file, so Concat is not relevant.

See below for timing differences.

Mask <abbrev>

Mask out all non-genic parts of the sequences before running MUMmer (gene annotation must be provided).

The <abbrev>, which is set in the Project parameters popup Abbreviation parameter, is used to determine which sequence will be masked.

Both sequences may be masked, which results in very fast execution and gene-based synteny.

If Mask is changed after A&S, the alignment files need to be removed with Clear Pair and A&S run again.

Off

Concat: The following statistics are from comparing Arabidopsis thaliana (119M) against Brassica rapa (297M) on a MacOS using 1 CPU.

Concatenated		Not concatenated
48819 hits 334 synteny blocks 46319 gene hits 38334 synteny hits Finished in 1 hour 8 minutes		48846 hits 334 synteny blocks 46348 gene hits 38345 synteny hits Finished in 1 hour 35 minutes

MUMmer parameters

The default MUMmer parameter seems to work fine with SyMAP, so probably do not need changing.

Parameter	Description	Default
PROmer Args¹	Arguments for PROmer	-
NUCmer Args¹	Arguments for NUCmer	-
Self Args²	Arguments to use when aligning a chromosome to itself	-
PROmer Only³	Use PROmer for all alignments	Off
NUCmer Only³	Use NUCmer for all alignments	Off

¹ BEWARE: Entered PROmer and NUCmer arguments are NOT checked for correctness. See MUMmer parameters.
²When self-alignment is performed, standard arguments are used when comparing different chromosomes. However, additional arguments may be desired when a chromosome sequence is run against itself, e.g. --nosimplify.
³ By default, PROmer is used for alignments between different projects, while NUCmer is used for self alignments.

All MUMmer files but the those with the .mum suffix are removed by symap. If you prefer them not to be removed, use the "-mum" command line parameter, i.e.

 ./symap -mum

3.c Cluster Hits

Algo1 vs Algo2 with hints

Parameter description

Pseudo and Piles

Go to top

3.c.I Algo1 vs Algo2

Algorithm 1 (modified original, abbreviated Algo1):
Pros	This is an generic algorithm that has knowledge of genes versus intergenic hits. It is recommended for ordering sequence contigs and when there are little or no gene annotation. It must be used for self-synteny. It has been used on 100's of genome comparisons.
Cons	It does not distinguish between exon and intron hits. It is more likely to miss good homologous gene pairs.
Parameters	It only has one parameter, which is easier to run but there is no control over what hits are filtered.
Algorithm 2 (exon-intron, abbreviated Algo2):
Pros	This is a new algorithm with explicit knowledge of gene pairs and their exon-intron structure. When there is good gene annotations for both genomes, this is definitely the superior algorithm. It takes less memory.
Cons	It does not perform self-synteny. It does not work when a given chromosome is split over multiple MUMmer files; this will NOT happen when SyMAP generates the MUMmer files.
Parameters	It has two set of parameters, hence, more control over results than Algo1. See Hints below the parameter explanation; the parameters generally do not need adjusting.

Algo1 is the default for self-synteny and if there is no annotation; else Algo2 is the default.

Wrong strand The wrong strand is when all hits in a cluster are to the same strand (++/--) yet the cluster aligns to two genes on the different strands (+-/-+), or vice versa.

Algo1 includes these hits. You can view them in the Queries where the Hit St column will be different than the two gene Gst columns.

Algo2 does NOT include these hits. You can request to view the potential hits during the A&S by running it with the "-wsp" flag, i.e. ./symap -wsp . This will only show gene pairs with (1) multiple hits to exons (in one or multiple gene pairs), (2) at least one is not an overlapping gene. It is up to the user to determine what is real.

Hints about parameter settings

Hint for Algo1: Increasing the Top N parameter can cause too many hits and reduce synteny. Decreasing it can remove more gene-pair hits. Hence, try Algo2 if you want more gene pairs.

Hint for Algo2: On the output to the terminal (in Verbose mode), if any chromosome pair shows over 10,000 hits, the parameters probably need to be made more stringent. Too many hits confuses the synteny algorithm, which results in synteny blocks not being found; it also results in very long execution time.

Suggestion: For large genomes, experiment with the parameters on just one pair of the chromosomes. (You can use xToSymap for the split.)

I have experimented with the datasets: (1) human, chimpanzee, mouse (2) Arabidopsis, Brassica rapa, Brassica oleracea. Only B. rapa to B. oleracea needed parameter adjustment: the number of G1 hits was over 200k, which is way more than typical; by increasing all parameters a small amount, this reduced to just over 100k.

3.c.II Parameter description

Defaults:

Algorithm 1 (Algo1): If at least one project does not have annotation or it is self-synteny.
Algorithm 2 (Algo2): If both projects have annotation and it is not self-synteny.

The image on the left shows the defaults for Algorithm 2.

Parameter

Description

Number Pseudo

If selected, the un-annotated ends of hits will be assigned a pseudo number. This is explained below in Pseudo genes.

Algo1 (original)

Top N piles

It will retain the top N hits of a pile of overlapping hits (Pile of Hits), as well as all hits with score at least 80% of the Nth hit.

Algo2 (gene-centric)

Scale

Scale	Description
Gene	Determines the percentage of required gene coverage. The percentage is based on N*internal-parameters.
Exon	Same as above except for exon.
Len	For G2 and G1^, minimum size hit unless it almost completely* covers a gene. The minimum is N*300bp. With default N=1.0, the minimum is 300bp.
G0_Len	For G0^, minimum size hit. The minimum is N1000bp. With default N=1.0, the minimum is 1000bp. Suggestion: increase for closely related species.
^*G2=gene to gene, G1=gene to non-gene, G0=non-gene to non-gene

Increase a scale to filter out more clustered hits, decrease to filter out less.
Only one end of the hit may pass the rules.
The internal rules balance coverage and length for the two ends.

Keep piles

EE, EI, En, II, In (E=exon, I=intron, n=non-gene)

This ONLY applies when there is a pile of overlapping hits (Pile of Hits);
it tells the algorithm what type of cluster hits to retain if they are in a pile.
Hits are filtered before pile analysis.
Intergenic-intergenic pile hits, and any unchecked categories, are filtered as described for Top N piles.

Top N piles

Algo2 uses Algo1 Top N parameter for any uncheck categories, but in a more conservative way. It will retain the Top N hits of a piled region that have lengths within 80% of the longest hit.

3.c.III Pseudos and Piles

Pseudo genes

The end of a hit may not overlap an annotated gene; by default, this will just show a Gene# of 'N.~' where N is the chromosome number.

If Number Pseudo is selected, a pseudo Gene# will be assigned. The counts start after the annotated gene numbers and are suffixed by "~". For example, if the last Gene# for Chr03 is 5550 (e.g. 3.5550.), the first pseudo gene number will be 6000 (e.g. 3.6000.~).

If A&S was run without numbered pseudos, go to the Pair Parameter panel, and select Number Pseudo in the lower-left drop-down; only this algorithm will be run. This cannot be undone; you would need to re-run A&S with the Number Pseudo unchecked to remove them.

Pros	If you would like the Queries Cluster and Report to include un-annotated hits. If you are exploring new candidate genes, numbered pseudos are easier to track. If your genome is not annotated, numbered pseudos are easier to track.
Cons	The Queries results can be easier to view with the 'N.~' as it is more distinct from a real Gene#.

If comparing more than 2 species, it makes the most sense to have them all numbered or not numbered (though a mix will work).

Piles of Hits

The below image shows a pile of hits on the left (Cabb C5) that link to repetitive genes on the right (Arab Chr02). These are important to keep.

The right image shows a pile of hits in an intergenic region (Cabbage Chr03) to multiple other regions (B.rapa Chr01). There are MANY occurrence of repeats like this in the MUMmer file, which is why these piles must be filtered; if they are not, the synteny algorithm does not perform well.

3.d Synteny

The image on the left shows the defaults. The one exception is that Strict is turned off for draft sequence.

Parameter	Description
Min Hits	Minimum number of anchors required to define a synteny block.
Strict	This uses the Original algorithm with the following changes: It used smaller gap sizes and stricter PCC cutoff. It first computes the blocks based on orientation. If Orient is not checked, it checks blocks and collinear sets to see if they are contained in another block, and if so, merges them regardless of orientation; this merge happens regardless of the Merge setting.
Orient	All hits in a block must have hits of the same orientation ('+/+' or '-/-') or different orientation ('+/-' or '-/+').
Merge	Overlap: The blocks must overlap to be merged. Close: Blocks that overlap or are close will be merged.

If the Align&Synteny has already been executed, and you want to try different synteny parameter, you may just run the synteny algorithm as described in Save.

See Synteny results for comparison of using the different synteny parameters. The following is a brief comparison of three images of the same regions when evaluated with the following 3 parameter sets:

Default (one block) Same orient (three blocks) Same orient with Merge (two blocks).
In the last image, the reverse orientation block is embedded in another block

Order against

For draft sequences, they may be ordered against another project. See Ordering details.

The Draf->Seq2 and Seq2->Draf use the Abbreviation set in the Project parameters panel. The "->" indicates that the first sequence will be ordered against the second.

If the draft has been aligned to the Order against sequence, but this option was not set, it can be set and the Synteny Only setting used (described below in 3.e Save).

Hints about synteny parameter settings

It is easier to experiment with synteny parameters since Synteny Only can be used to speed it up. For two genome complete sequence synteny, it is strongly suggested you start with Strict.

→ Suggestion: For large genomes, experiment with the parameters on just one or two pairs of the chromosomes; you can use xToSymap for the split. Note: the synteny results of one or two chromosomes will be slightly different compared to whole genome synteny.

3.e Save

Lower-left drop-down

If the Align&Synteny has already been run, the lower left side of the parameters window will have a drop-down with the following options:

Clust&Synteny	If you have changed the clustering parameters, select this option.
Synteny Only	If you have only changed synteny parameters, select this option.
Number Pseudo	If you want to add pseudo numbers, select this option.

Manager: The Selected Pair button will have its label replace to reflect the drop-down setting, as follows:

  Clust&Synteny → Selected Redo
  Synteny Only → Synteny Only
  Number Pseudo → Pseudo Only

If you have changed the alignment parameters, you must remove them using the Clear Pair option on the Manager panel.

Saving pairs parameters

Before the A&S is executed, the parameters are saved in

  data/seq_results/<proj1-to-proj2>/params.txt

Once the A&S is executed, the parameters are stored in the database.

The file parameters are shown on the pair Parameter panel. These can only be viewed and changed using symap (not viewSymap).

Any parameter not the default will be shown on the Summary page.

BEWARE: If you run A&S, then change the PROmer or NUCmer settings, but forget to Clear Pair before running A&S again, the parameters on the Summary page will be wrong (SyMAP does not check for this situation).

4. Synteny parameter comparisons

The following is from comparing Arabidopsis thaliana chromosomes 1 and 2 with Brassica rapa chromosomes 1 and 7. The Clustering hits Algo2 (exon-intron) was used in all cases.

Original vs Strict: Both allow a mix of inverted and non-inverted hits in a block, but Strict only allows small inversions within a non-inverted block and vice-versa (unless Merge is used). The Strict blocks tend to not have the tails of dots that have bigger gaps.
Original	Strict
Original chr1&chr7	Strict chr1&chr7

Orient: This requires all hits in a block to be in the same orientation. This options fits with what some other software packages consider 'synteny'.
Original Orient chr1&chr7	Strict Orient chr1&chr7

Merge Overlap: Only overlapping blocks can be merged. This will have little effect if Orient has been used since mixed blocks cannot be merged.
Original Merge Overlap chr1&chr7	Strict Merge Overlap chr1&chr7

Merge Close: Blocks that are close can be merged.
Original Merge Close chr1&chr7	Strict Merge Close chr1&chr7

Go to top

Email: cas1@arizona.edu