AGCoL

  Input

Sequence files Annotation files Location of files xToSymap NCBI Ensembl Other General

  Go to top


       UA BIO5        SyMAP
Home
  
Download   Docs          System
Guide
  
Input   Parameters          User
Guide
  
Queries

 

  Input files
  Sequence
     Identifier (seqid)
     Masking
  Annotation
  Location of files
      Load Project
      xToSymap
  Interface
  Summarize
  Convert
  Lengths
  Split
      Convert
  NCBI
  Ensembl
  Other
  General
   Secondary name
   Loading the original files
          Tested Genomes
   Editing the convert scripts

Input files

Sequence          Annotation          Location of files Go to top

The input is one or more FASTA files of sequences (genome, scaffold, contig), with optional GFF formatted annotation file(s).

Sequence files

The sequence file(s) must be FASTA format with one or more sequences. A header line, which starts with a ">", occurs before the sequence, e.g.

  >Chr01 NC_003070.9 chromosome
  ccctaaaccctaaaccctaaaccctaaacctctGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATG
  AATCCCTAAATACCTAAttccctaaacccgaaaccggTTTCTCTGGTTGAAAATCATTGTGtatataatgataattttat...
The first word (Chr01) is used as the sequence identifier (seqid) in symap. If the seqid is >20 characters, the sequence will be ignored. The second word is the secondary name (NC_003070.9), see Secondary name.

Identifier (seqid)

Important points in naming sequences for SyMAP:
A. Sequence identifiers can only contain letters, numbers, underscores "_", dash "-" or period ".". This is in contrast with names allowed in gff3 files, which allows the seqid to contain the additional characters of [:^*$@!+?|]. The sequence identifier (seqid) must be renamed if it contains any of these characters.
B. The sequence identifiers must exactly match those used in the annotation files (first column), or the annotations will not be loaded.
C. Use a consistent prefix such as "Chr" for all sequences, then set Group prefix to the prefix in project's Parameter panel.
D. If there is not a consistent prefix, you may leave the Group prefix blank; beware, this can have unintended results, so should be avoided if possible. Make the names short so they will not clutter the display. You may need to rename your sequence, in which case, it must be done in the FASTA and GFF files.
E. The xToSymap NCBI and Ensembl conversion programs will create a short name for the sequences and provide the renaming in both the FASTA and GFF files. One of these programs may also work for files created elsewhere; see Other inputs.

Masking

Masked sequence:

Mask genes:

Annotation files

Annotation files should be in gff3 format, which is a tab-delimited file of 9 columns.
Seqid The first column must exactly match the sequence identifier in the FASTA files.
Type The third column determines how SyMAP uses the entry.
Only types gene, mRNA, exon, gap or centromere are recognized.
Attribute The last column contains "keyword=value" pairs describing the annotation.
a. The genes/mRNAs must have a ID= keyword, and your mRNAs/exons must have a Parent= keyword.
b. For genes, by default all attributes are saved in the database for viewing. You can set which attributes keywords to save by opening the Project Parameter panel and entering them for the parameter Anno keywords.
Order Each mRNA and its exons must come after the parent gene and before the next gene.
The mRNA must be before its exons, but it is okay if all the mRNAs for a gene are listed first.
Exons are entered in the order they are found; they must be either ascending or descending.

For example (extracted from Ensembl gff3 file):

1   araport11   gene   11649   13714   .   -   .   ID=gene:AT1G01030;Name=NGA3;....
1   araport11   mRNA   11649   13714   .   -   .   ID=transcript:AT1G01030.2;Parent=gene:AT1G01030....
1   araport11   three_prime_UTR   11649   11863   .   -   .   Parent=transcript:AT...
1   araport11   exon   11649   12354   .   -   .   Parent=transcript:AT1G01030.2....
1   araport11   CDS   11864   12354    .   -   2   ID=CDS:AT1G01030.2;Parent=transc...
1   araport11   CDS   12424   12940    .   -   0   ID=CDS:AT1G01030.2;Parent=transc...
1   araport11   exon   12424   13173   .   -   .   Parent=transcript:AT1G01030.2....
1   araport11   five_prime_UTR   12941   13173   .   -   .   Parent=transcript:AT...
1   araport11   exon   13335   13714   .   -   .   Parent=transcript:AT1G01030.2....
1   araport11   five_prime_UTR   13335   13714   .   -   .   Parent=transcript:AT...
1   araport11   mRNA   11649   13714   .   -   .   ID=transcript:AT1G01030.1;Parent=gene:AT1G01030....
1   araport11   three_prime_UTR   11649   11863   .   -   .   Parent=transcript:AT...
1   araport11   exon   11649   13173   .   -   .   Parent=transcript:AT1G01030.1....
1   araport11   CDS   11864   12940    .   -   0   ID=CDS:AT1G01030.1;Parent=transc...
1   araport11   five_prime_UTR   12941   13173   .   -   .   Parent=transcript:AT...
1   araport11   exon   13335   13714   .   -   .   Parent=transcript:AT1G01030.1....
1   araport11   five_prime_UTR   13335   13714   .   -   .   Parent=transcript:AT...

For the above example, the following information will be saved in the SyMAP database:

1   gene   11649   13714   ID=AT1G01030;Name=NGA3;rnaID=AT1G01030.2;desc=....
1   exon   11649   12354
1   exon   12424   13173
1   exon   13335   13714


Location of files

Project name and directory:

Location:

Each project has a directory as follows:

  /data/seq/<project-name>

The default location for sequence and annotation files is:

  /data/seq/<project-name>/sequence
  /data/seq/<project-name>/annotation
Create project: Use one of the following ways to indicate to symap where the project's input files are:
  1. Default: Create these sub-directories under /data/seq and put your files there,
    e.g using project-name=foobar:
      cd data/seq
      mkdir foobar
      cd foobar
      mkdir sequence
      mkdir annotation
    
    Move your FASTA file(s) to data/seq/<project-name>/sequence (e.g. data/seq/foobar/sequence) and your optional GFF files(s) to data/seq/<project-name>/annotation (e.g. data/seq/foobar/annotation)

  2. xToSymap: This program puts the files in the default locations; see xToSymap.

  3. Link: Create these sub-directories under /data/seq and use soft links to point to the file locations,
    e.g using project-name=foobar:
      cd data/seq
      mkdir foobar
      cd foobar
      ln -s <location of directory of sequence files> sequence
      ln -s <location of directory of annotation files> annotation
    
  4. Add and define:
    • Use the Add project button on the symap interface (lower-left corner; see Manager). This add the project name to the data/seq directory.
    • Use the project's parameter panel to enter the location of the sequences and optional annotation files into the Sequence files and Anno files parameters.

Load Project

For options 1-3, it is not necessary to enter the locations of the files in the project parameter panel since they use the default locations.

All sub-directories in data/seq are shown on the left-panel under Projects using its Display name, which is set in the project's parameters.

Select a Project to show it on the right-hand side; the project-name is shown beside the label Directory.

projNew2
Load project accesses the files and loads them into the database. Once loaded, the View option will be present. Always check the results before continuing!! loaded

The symap program requires the directories to be present (and will create them if necessary), viewSymap does not access the directories.


xToSymap Interface

Interface Summarize Convert Lengths Split  Go to top


xToSymap was added to the SyMAP tar file in release v5.5.7 and updated in v.5.5.8.

Project Directory: All options and function buttons will be disabled until a project directory is selected.

The options Converted, Lengths and Split are disabled when there is no /sequence sub-directory, which is created when Convert is executed.

Convert options:

  • When NCBI is selected, all the options relevant to it will be enabled. Similarly for Ensembl.
  • NCBI has the Hard Mask option and Ensembl has the Only #,X,Y,I option.
  • The rest of the options work for both.
  • See NCBI and Ensembl for a description of the options.
Perform the following steps:
1. Summarize This will output basic statistics. Run this first to make sure your input is valid.
2. Convert The NCBI and Ensembl options create the /sequence and /annotation sub-directories with files ready for input to symap. If these sub-directories exist, any existing FASTA and GFF files will first be removed.
3. Summarize
with Converted
Run this to make sure that the conversion worked as you expected.
This shows the sequence prefixes, and can be used to set the Group Prefix in the symap Project Parameters.
Optional. The following run on the files in the /sequence and /annotation sub-directories.
4. Lengths Run it to see if you need to set a Minimal lengths value in symap Project Parameters.
5. Split Splits the converted file genomic.fna and anno.gff into a file per chromosome.

NOTE: The NCBI and Ensembl files have variations and I may not have accounted for them all. If Convert does not work as you wish, the files ConvertNCBI.java and ConvertEnsembl.java are available for Edit.

Summarize

Summarize will look in the following directories for the FASTA and optional GFF files to process, in the following order (this assumes a project name is cabb):
  1. Directly under the project's directory, e.g.
     symap_5/data/seq/cabb
        Brassica_oleracea.BOL.dna_sm.toplevel.fa.gz
        Brassica_oleracea.BOL.59.gff3.gz
    
  2. The project's ncbi_datasets/data/<sub-directory>, e.g.
     symap_5/data/seq/cabb/data/seq/ncbi_dataset/data/GCF_000695525.1
        GCF_000695525.1_BOL_genomic.fna
        genomic.gff
    
  3. The project's sub-directories /sequence and /annotation, e.g
     symap_5/data/seq/cabb
        sequence/
           genomic.fna
        annotation/
           anno.gff
    
The FASTA file must end in .fas, .fa, .fna, .fasta, .seq (with optional .gz).
The GFF file must end in .gff or .gff3 (with optional .gz).

Summarize with Converted assumes there are converted files.

The following is an example terminal and log file output:

------ Summary for ./data/seq/cabbN ------
Log file to  ./data/seq/cabbN/xSummary.log
Sequence directory: ./data/seq/cabbN/ncbi_dataset/data/GCF_000695525.1      29-Jul-2024 10:16
   GCF_000695525.1_BOL_genomic.fna
Annotation directory: ./data/seq/cabbN/ncbi_dataset/data/GCF_000695525.1
   genomic.gff

FASTA sequence file(s)
Example header lines:
   Chr:  >NC_027748.1 Brassica oleracea var. oleracea cultivar TO1000 chromosome C1, BOL, ...
   Scaf: >NW_013617415.1 Brassica oleracea var. oleracea cultivar TO1000 unplaced genomic ...
   Mt:   >NC_016118.1 Brassica oleracea mitochondrion, complete genome

Count Totals:
      9 Chromosomes
 32,876 Scaffolds
      1 Mt/Pt
 32,886 Total sequences

Count Prefixes:
 32,876 NW_
     10 NC_

Counts of Length Ranges:
   32,284<10k   552<100k   41<1M   <10M   5<50M   4<100M
   488,954,160 Total length

GFF Annotation file(s)
Summary:
  44,386 Genes from 49,563    (use protein_coding only)
  44,382 mRNAs from 56,687    (44,382 has protein_id)
 227,191 Exons from 398,922

Gene Attribute:
  44,386 Dbxref                     44,386 ID
  44,386 Name                          569 end_range
       7 exception                  44,386 gbkey
  44,386 gene                       44,386 gene_biotype
      81 locus_tag                       7 part
     985 partial                       499 start_range
mRNA Attribute:
  44,305 product

Input type: NCBI chr NC_ prefix; NCBI GFF header
            NCBI 'gene_biotype' keyword; NCBI mRNA 'product' keyword
------ Finish summary for ./data/seq/cabbN/ ------
Observations:
1. The last line confirms that the NCBI radio button should be selected for Convert.
2. It shows that there are 9 chromosomes to be converted to symap files.
3. It confirms that an mRNA and exons are found for each protein-coding gene.
4. On the first example line, it has 'chromosome C1'. Since the 'chromosome' is followed by 'C1' instead of a digit (i.e. 1), the chromosome identifier will be 'C1' in the conversion.
5. After conversion, the summary will be like this log file.
There is a remark that the file may be hard-masked, which was not done by the ConvertNCBI; strangely, the chromosomes in the NCBI file were hard-mask but the scaffolds were soft-masked. If the summary has been computed with Verbose, it would show the count of all nucleotides upper and lower case, which would show zero lowercase and many N's.

Verbose

The Ensembl cabbage files Verbose summary log file look like this Ensembl summary log. The verbose mode takes longer to execute, but has the following extra information:
As shown in Tested genomes, it appears to be more common not to have a letter before the chromosome number (e.g. chromosome 1), in which case the chromosome identifier will be Chr01. For example, for Homo sapiens, see the NCBI summary log compared to the Ensembl summary log (both are Verbose output).

Summarize shows a few of the header lines, and with Verbose checked, more are shown. This should aid in determining how to set the options. If this is not enough, execute the following on your FASTA file:

   zgrep ">" [use your FASTA file name]

Convert

1. Select NCBI or Ensembl.
2. Select the options you want; see NCBI or Ensembl options.
3. Select Convert.

Lengths

The FASTA files must be in the project's /sequence sub-directory (e.g. symap_5/data/seq/cabbN) for this to work. It reads the FASTA file(s), prints out the names and lengths of all sequences, then writes a summary.

Summary: If your file has many small scaffolds and you just want to process the large ones, the summary will help you decide what value to use as the Parameter Minimum length values. For example, the following is output from Length:

Values for parameter 'Minimal length' (assuming no duplicate lengths):

#Seqs Minimum length
   10  550,871
   20  193,719
   30  152,041
   40  122,531
   50   98,180
   60   85,914
   70   70,524
   80   65,697
   90   61,049
  100   58,280
If you set the Minimal length at 550,871, SyMAP will process the 10 largest sequences.

Split

The input files are:
	sequence/genomic.fna
	annotation/anno.gff
The function will do the following: These work as input to symap. It is useful when only one or a few chromosomes are to be processed in symap, i.e. remove the ones you do not want processed.

General

Secondary name Original files Tested genomes Editing the convert scripts Go to top

Secondary name

A secondary name can be assigned during the symap Load Project action. If a sequence header has between 2-3 words, the 2nd word will be used as the secondary name as long as it is <=20 characters. The following example shows the header line of an NCBI xToSymap converted file.

  >Chr01 NC_003070.9 chromosome
The symap Load Project will make Chr01 the unique identifier and NC_003070.9 will be its secondary name. This secondary name will be shown on the View popup.

Ensembl xToSymap converted files do not have a secondary name.

Note: The secondary name was introduced in v5.8.2.

View with secondary

Original files

Load Original NCBI File: Though it is not recommended, the original NCBI files will probably load. An example header line will be:

  >NC_003070.9 Arabidopsis thaliana chromosome 1 sequence
In this case, NC_003070.9 will be its identifier and "Chr01" will be its secondary name. The secondary name is only assigned if the value "chromosome " or "scaffold " exists on the header line.

Load Original Ensembl File: This will usually work, but generally no secondary name is assigned. However, in some cases it does assign one incorrectly, which will look weird in View.

Loading the original files can lead to small problems. For example,

Tested genomes

In Aug 2024, the scripts were tested on the following:
GenomeNCBIEnsemblSequence
bp in file
Note
Arabidopsis thaliana (thale crest)Chr 1-5Chr 1-5119M
Brassica oleracea (wild cabbage)C1-C9C1-C9489MC prefixes chr#
Oryza sativa (rice) Chr 1-12Chr 1-12386M 
Prunus persica (peach)G1-G8G1-G8228MG prefixes chr#
Prunus yedoensisScaffolds 4015N/A319MDraft
Caenorhabditis elegans (worm)Chr I,II,III,IV,XChr I,II,III,IV,X100M 
Bemisia tabaci (whitefly)Scafs 19,750Contigs 227615M 
Danio rerio (zebrafish)Chr 1-25Chr 1-2552,633MContains 930 ALT CHR
Oryctolagus cuniculus (rabbit)Chr 1-21, XChr 1-21, X2,841M 
Homo Sapiens (human)Chr 1-22, X, YChr 1-22, X, Y3,298M 
Other species have been downloaded and tested since 2024.

See Tested datasets and timings for the Alignment&Synteny compute times for a range of species pairs.

xToSymap - exceptions to the defaults

The above downloaded genomes were all converted with the xToSymap defaults, with the following exceptions:

Input variations: There are probably more variations then shown here, which may possibly be handled with the Ensembl Only #,X,Y,I option or NCBI/Ensembl Only prefix option. If not, you may want to edit the script to tailor it (see Edit). Alternatively, you can contact me (cas1@arizona.edu) and I will make the edit for you (it allows me to add the your variation to xToSymap).

Load original files

To load the originals files into SyMAP, the following was done:

Problem during Load of originals:

A&S with the originals:

Editing the convert scripts

These scripts are simply written, using only standard operations found in all major languages.

Go to top

Email Comments To: cas1@arizona.edu