NCBI Convert to SyMAP

NCBI convert to SyMAP

Overview

This document corresponds to the ConvertNCBI release with SyMAP v5.0.2 and updated for SyMAP v5.0.7 (dated 4-Feb-2022).

NCBI supplies FASTA formatted files for genome sequence and GFF3 formatted files for the annotation, where FASTA and GFF3 files are the input to SyMAP. However, using them directly can cause problems. The following provides a simple scheme to produce only the files necessary.

Contents

1. Download

2. Convert files

3. Load files into SyMAP

   4. What the ConvertNCBI script does

   5. Scaffolds

   6. Editing the script

Download

Go to NCBI.
As shown in Fig 1, select "Genome" from the pull-down at the top.
Enter you genome name followed by "Search". You should see a page similar to Fig 2.
Download the FASTA and GFF file. Three approaches:
1. As shown in Fig 2:
  Use the genome link beside the "Download sequences in FASTA format for genome,..."
  Use the GFF link beside the "Download genome annotation in GFF...".
  Process as described in Convert A.
2. As shown in Fig 3:
  Go to the NCBI Datasets page (link at bottom of Fig 2), select the specific genome you want, then "Download". That brings up a window as shown in Fig 3. Select genome and GFF3 for download.
  Process as described in Convert B.
3. Use the RefSeq link in Fig 2 and download the files with the "fna.gz" and "gff.gz" suffixes.
  Process as described in Convert A.
If NCBI only provides the Genbank format, try using bioconvert to convert to GFF3, followed by the SyMAP converter. Alternatively, try Galaxy.

Fig 1. Search the NCBI site.

Fig 2. Download A: download the genome and GFF files.

Fig 3. Download B: from NCBI Datasets, select the genome and GFF3 files.

Convert files

The following conversions were tested on the Oryza Sativa NCBI files on 31-Jan-22. The download A genome sequence was soft-masked, but the download B was hard-masked.

Convert files from FASTA and GFF (download A)

Go to the symap_5/data/seq directory.
Make a subdirectory for your species and move the FASTA and GFF files into the directory. Leave the fna.gz and gff.gz suffixes on the files.
From the seq directory, type the following at the command line to copy the ConvertNCBI script:
```
cp ../../scripts/ConvertNCBI*.class .
chmod 755 *.class
```
Execute
```
java ConvertNCBI <species>
```

Example

From the symap_5 directory:

> cd data/seq
> mkdir rice
> cd rice
> mv ~/Download/GCF_001433935.1_IRGSP-1.0_genomic.fna.gz .
> mv ~/Download/GCF_001433935.1_IRGSP-1.0_genomic.gff.gz .
> cd ..									
> cp ../../scripts/ConvertNCBI*.class .
> chmod 755 *.class
> java ConvertNCBI rice

This results in the following contents:

data/seq
	ConvertNCBI.class
	ConvertNCBI$Gene.class
data/seq/rice/
      GCF_001433935.1_IRGSP-1.0_genomic.fna.gz
      GCF_001433935.1_IRGSP-1.0_genomic.gff.gz
      annotation/
         gene.gff
         gap.gff
         exon.gff
      sequence/
         genomic.fna

The output gives useful details of the annotation (e.g. see rice details); if the details do not appear right, you may need to edit the script for your genomes.

Convert files from ncbi_dataset.zip (download B)

Follow all steps form Download A, except for step 2, do the following:

Make a subdirectory for your species and move the ncbi_dataset.zip file to the species directory and unzip it.

Example From the symap_5 directory:

> cd data/seq
> mkdir rice
> cd rice
> mv ~/Download/ncbi_dataset.zip .
> unzip ncbi_dataset.zip
Archive:  ncbi_dataset.zip
  inflating: README.md
  inflating: ncbi_dataset/data/data_summary.tsv
  inflating: ncbi_dataset/data/assembly_data_report.jsonl
  inflating: ncbi_dataset/data/GCF_001433935.1/chr1.fna
  inflating: ncbi_dataset/data/GCF_001433935.1/chr2.fna
  ...
  inflating: ncbi_dataset/data/GCF_001433935.1/genomic.gff
> cd ..
> cp ../../scripts/ConvertNCBI*.class .
> chmod 755 *.class
> java ConvertNCBI rice

This results in the following contents (some NCBI files not listed):

data/seq
	ConvertNCBI.class
	ConvertNCBI$Gene.class
data/seq/rice/
	ncbi_dataset.zip
	ncbi_dataset/data/GCF_001433935.1/
		chr1.fna
		...
	annotation/
		gene.gff
		gap.gff
		exon.gff
	sequence/
		genomic.fna

The output gives useful details of the annotation (e.g. see rice details); if the details do not appear right, you may need to edit the script for your genomes.

ConvertNCBI optional flags

Flag	Description	Details	Default
-m	Hard-mask	NCBI genome sequences are soft-masked, which is changed to hard masked	Leave as soft-mask
-v	Verbose	Print out header lines of skipped sequences	No print
-s	Include Scaffolds in output	See section Scaffolds	No scaffolds
-l	Use linkage groups	Search 'linkage' instead of 'chromosome'	Use chromosomes
-r	Use only RefSeq records	-r and -g can be used together	Use all sources*
-g	Use only Gnomon records	-r and -g can be used together	Use all sources*

*If neither -r or -g is used, then all sources are used.

Load files into SyMAP

The above scenario puts the files in the default SyMAP directories. When you start up SyMAP, you will see your projects listed on the left of the panel. Check the projects you want to load, which will cause them to be shown on the right of the symap window and continue as described in the System Guide.

What the ConvertNCBI script does

The following occurs in the data/seq/<project directory name> where "project directory name" is the argument supplied to ConvertNCBI. The following is for Download A, but works similarly for Download B. It assumes no parameters are set (e.g. -s for scaffolds).

Reads the file ending in '.fna.gz' (or '.fna') and writes a new file called sequence/genomic.fna with the following changes:
1. Sequences must have the word "chromosome" in their ">" header line in order to be copied (unless -l or -s flags).
2. The header line is replaced with ">ChrN" where N is 1,2... (Note, this assumes that the chromosomes are in order in the file as it does not read the chromosome number from the header line). For example,
```
>NC_029256.1 Oryza sativa Japonica Group cultivar Nipponbare chromosome 1, IRGSP-1.0
```
  is replaced with:
```
>Chr1  NC_029256.1
```
3. Gaps of >30,000 are written to the annotation/gap.gff file (30,000 is hard-coded in ConvertNCBI script).
Reads the file ending in 'gff.gz' (or .gff) and writes two new files called annotation/gene.gff and annotation/exon.gff, as follows:
- gene.gff:
  1. Only lines with type 'gene' and attribute 'gene-biotype=protein-coding' are processed.
  2. Only lines with type 'mRNA' and have an accepted gene parent line are processed.
  3. Only lines with type 'exon' and have an accepted mRNA parent line, where the mRNA is the first for the gene parent, are processed.
  4. The gene line is written to the gene.gff file with the following changes:
    1. The first column 'seqid' is replace with the 'ChrN' value assigned when reading the '.fna' file.
    2. The last column 'attributes' contain "ID=gene-<id>", "ID=rna-<id>" and the "product=" values from its mRNA lines. The gene attribute "Name=" is included if it is not a substring of the gene-ID.
    3. The product value is created as follows:
      1. If there are multiple mRNA lines for a gene where the values are different, they are concatenated together.
      2. If there are multiple mRNA lines for a gene where the only difference is the variant, then only the variant difference is show, e.g.
        product=monocopper oxidase-like protein SKU5%2C transcript variant X2, X1, X3
- exon.gff
  1. The exon line is written to the exon.gff file with the first column 'seqid' is replace with the 'ChrN' value assigned when reading the '.fna' file.
  2. The ID and gene attributes are the only two keywords used in the attributes column. These are not used by SyMAP but are useful for verification.

Scaffolds

By default, the ConvertNCBI script creates the genomic.fna file with only the chromosomes. However, you can have it also include the scaffolds by using the "-s" flag, e.g.

java ConvertNCBI rice -s

This will include all chromosomes (prefix 'c') and scaffolds (prefix 's') in the genomic.fna file. Beware, there can be many tiny scaffolds. If they all aligned in SyMAP, it causes the display to be very cluttered. Hence, it is best to just align the largest ones (e.g. the longest 50); merge them if possible, then try the smaller ones. You should set the following SyMAP project parameters:

grp_prefix needs to be blank as there is no common prefix now.
min_size should be set to only load the largest scaffolds. To determine the value to use, run the lenFasta.pl script, e.g. from the seq directory and using rice as an example:
```
cp ../../scripts/lenFasta.pl .
perl lenFasta.pl rice/sequence/genomic.fna
rm lenFasta.pl		# do NOT leave this script in the sequence directory
```

As of 30-Jan-22 (GCF_001433935.1_IRGSP-1.0_genomic.fna), rice has 58 sequences where 12 are chromosomes, 43 are scaffolds and 3 are other. The script outputs all their sorted lengths followed by the following table:

Read genomic.fna and print sorted lengths
Read 55 sequences

Lengths:
   1 43270923
   2 36413819
...
Values for min_len (assuming no duplicate lengths):
#Seqs  min_len
   10 27531856
   20    19457
   30    11447
   40    10311
   50     7140

To align the top 30 sequences (12 chromosomes, 18 of the largest scaffolds), this says to set min_size to 11447.

Editing the script

This script was used to build the 2020 syntenies from the NCBI genome and annotation files, which can be viewed at symapdb3 (the applets are obsolete).

However, you may want the make changes such as what attributes are included. Therefore, the ConvertNCBI.java code is supplied in the scripts directory. It is very simply written, it does not use external libraries and only uses common programming techniques.

Once you make your changes, execute:

javac ConvertNCBI.java

You will need to have JDK installed to use the 'javac' command.

Email Comments To: symap@agcol.arizona.edu