This document describes the TCW assembly process,
which is run via the runSingleTCW interface or the script execAssm using the parameter file
projects/<project-name>/sTCW.cfg.
Note that TCW cannot assemble raw RNA-seq reads. Rather, assembly in TCW serves primarily the following purposes:
- Assemble ESTs or 454 reads.
- Assemble multiple transcripts libraries with optional count data.
- Assemble transcript library(s) with optional count data with ESTs.
Contents
Assembly of the Demo Project
This demo uses the project demoAsm. The assembly process uses blast and cap3 (see
Installation).
IMPORTANT NOTE: The Blast path MUST be defined in the HOSTS.cfg using the blast_path parameter;
see Install.
Start by running runSingleTCW and selecting project demoAsm, giving the
interface shown on the right.
The project has two libraries, Illumina (an Illumina RNA-seq library) and Sanger
(an EST Sanger library). The Illumina transcripts have read counts from two
count libraries, tip and zone, reflecting two different tissues of the rhizome.
Press the Step 1. Build Database button to create the database and load the data.
Then set #CPUs to the number of processors your machine can spare for the assembly
(one or two are sufficient for this project).
Press Step 2. Instantiate to start the assembly. Make sure the Skip Assembly is NOT checked.
When done, it prints a summary (see below).
|
|
Final assembly summary
Below is the summary (the numbers can be slightly different due to a different machine, #CPUs, and blast version).
>>>Assembly Statistics 27-Mar-16 19:31:29
DATASET #SEQS #SINGLETONS #BURIED
Illumina 112 78 (69%) 13 (11%)
Sanger 98 7 (7%) 1 (1%)
Total reads: 210
Total buried: 14 Initial buries: 14 Buried during assembly: 0
Contig sizes (#reads)
Counts =2 3-5 6-10 11=20 21-50 51-100 101-1000 >1000
#Contigs 6 12 4 1 1 0 0 0
Contig lengths (bp)
Length 1-100 101-500 501-1000 1001-2000 2001-3000 3001-4000 4001-5000 >5000
#Contigs 0 60 19 21 2 1 0 1
Total contigs: 104
Contigs(>1 seq): 24
Single mate-pair: 5
Singletons: 80
Finished in 1m:40s
Transcript with counts and EST libraries
When assembling transcripts with counts and EST libraries, a resulting contig with one or more transcript sequences
and one or more ESTs will add the transcript counts and add the aligned EST.
Choosing Read Names
Using consistent and well-chosen read names makes data analysis much easier, and
is essential for some aspects of TCW.
The name of the read is the string immediately following the ">" in the fasta file.
For example, if your fasta file contains the lines
>ZM_BFa0001A01.f
AAGATCCGCCTCATTCACACCCCCATCTACCTAGCTAGCTAGTTTACCAAAAAAAAATCTGGCCACA
GGGATGCGGTGGCGGCTGCAGCCGGCGCCGGCGCCGACGCTGCTCCTCGTCCTGCTGGTG
>ZM_BFa0001A01.r
AAAAAGCAAAATACAAACCAAGCTCCAGTTCCAATACATTACTCTAGCACAAGCTTTCAG
CACATTACAAAGTAGGAACCAAGACCACCCAAGCTCCAATCACACTACAATTCATCACCA
then the two read names are ZM_BFa0001A01.f and ZM_BFa0001A01.r.
Naming guidelines:
- length/characters: Keep read names under 25 characters, using only letters, numbers, and underscores.
- uniqueness: No two reads in a TCW database may have the same name.
- prefixes: Use the library name as the read prefix (e.g. ZM_BFa in this example).
This makes it much easier to study the assembled contigs where different
libraries are mixed up.
- For 454 data, the names are meaningless to the typical user, hence, the reads should be
renamed with the library name followed by consecutive numbers.
- mate-pair suffixes: If your read contain 5'/3' mate pairs, indicate
this with suffixes (e.g. ".r", ".f" in this example). The suffixes must be absolutely
consistent within a library. If some read have ".r", while others have just "r", or
if some have ".r" meaning 3', while for others it means 5', then TCW cannot use the mate pair
information to improve the assembly.
The Assembly Process
Following are the main stages of TCW assembly, organized by their headings
which print to the screen. The sample durations
are for a 700k read assembly using 6, 2.4 Ghz CPUs and the default settings.
Section Heading | What TCW is doing | Sample duration |
>>>Delete previous assembly |
There was a previous assembly of the same name, which you have selected to delete and
start over. This can take a while for big projects!! | 2h |
>>>Initial bury alignment |
TCW sets aside ("buries") reads which are nearly identical to another read, in order to reduce
redundant assembly effort. It runs a "self-blast" of all the reads against
each other, using Megablast. It then parses the output and saves the buries to the database. | 1h 10m |
>>>Compute cliques |
A clique is a special type of cluster1. TCW groups the reads into cliques and builds the
initial contigs from them. To do this it again runs Megablast of the reads against
each other, this time using only the non-buried reads. | 1h 10m |
>>>Clique assembly |
Each clique is given to cap3 to assemble. Any leftover read are made into singleton "contigs". | 2h |
>>>Clique cap buries |
The initial contigs from clique assembly are now analyzed for additional reads to bury. Any
read lying in a region of 5x or greater coverage may be "cap-buried" in another read whose
span in the contig is close enough. This is controlled by sTCW.cfg parameters CAP_BURY_MIN_DEPTH
and CAP_BURY_MAX_HANG. For a large project, often over 50% of the clones will be buried
by the end of this stage. | 10m |
>>>Contig merge rounds |
Now TCW goes through the contig merge rounds specified by the "TC"2 parameters in sTCW.cfg. For
each round it writes out the current contig consensus sequences, blasts them against each
other, and attempts to merge each overlapping pair. | 9h |
>>>Finalize contigs |
Mate-pair contigs are joined together by N's. All buried reads are collected and assigned to
their correct final contig. Each read is re-aligned to the consensus sequence of its contig,
and the SNPs and extras are identified. Suspect contigs are flagged. | 1h 10m |
1 In a clique, each read must have an overlap with all reads in the clique.
2 In a TC (transitive closure), each contig must have an overlap with at least one contig in the TC.
Assembly Parameters
Note, the default parameters have been extensively tested and you will probably not want to change them.
Most of the parameters for assembly are available for change through the runSingleTCW interface (press
the Options button in the Assembly section).
Calculation of SNPs and extras
The parameters listed in the following can be set in the LIB.cfg file with an editor.
A SNP is possible when one or more read have a different base at some location than is found in the consensus.
However, base-calling error can lead to many false positives, so TCW applies two screens to the possible
SNPs. First, at least two reads must contain the SNP (you can change this with the SNP_CONFIRM parameter).
Also, a probability score is applied. The probability ('p-value') is computed using a binomial score based on the number of
confirming reads, the depth of the contig at that base, and the estimated basecall error rate.
The error rate is estimated from mismatches seen in the clique assembly, or it can be
set using BASECALL_ERROR_RATE. The p-value threshold can also be set using SNP_SCORE.
When there are extra bases in some reads which are not in the consensus sequence
generated by cap3, TCW uses another probability score to determine whether to regard the extras as "real"
and add a pad character (*) to the consensus. The score is computed in the same way as for SNPs, and uses
the config parameters EXTRA_CONFIRM, EXTRA_RATE, EXTRA_SCORE. Extras not determined to be real are
stored in the database and shown in the UI.
Trouble Shooting
- Blast fails. Try running the blast executable from the command line, which will usually elucidate the problem.
IMPORTANT NOTE:
The full Blast path MUST be defined in the HOSTS.cfg using the blast_path parameter;
see Install.
- CAP3 fails.
- If you get an "java.io.IOException: Cannot run program", then the supplied CAP3 is
not compatible with your OS. Go to seq.cs.iastate.edu/cap3.html,
and download a CAP3 compatible with your systems and put it in /Ext/linux/CAP3 or /Ext/mac/CAP3 for Linux or Mac, respectively.
- On recent Mac OS (e.g. Catalina), external programs that are not registered with Apple will not automatically run.
To fix this, see External program.
- Interruptions.
An assembly lasting multiple days can be interrupted for numerous reasons, e.g. running
out of memory, losing connection to the database, or having the system reboot. In most
cases this is not a problem and the assembly can be restarted to resume where it left
off. It will check its database for consistency, and if the assembly continues successfully,
then it should be fine, while if there are
errors then it should be restarted from the beginning.
- Clone assigned to multiple contigs. This happens sometime when a assembly has been restarted. Unfortunately,
you need to delete the database and start again.
- Crashes. If the assembly crashes, it will usually write the Java exception error into a
file stcw.error1.log. This is information that we can use to debug and fix the problem.
- For other problems, see Trouble Shooting.
For assembly, the database must support Innodb tables
TCW checks this using the "show engines" command in MySQL. If the Innodb
engine is not listed as supported, this error is shown; however, you can
still perform all TCW functions except for assembly.
The most common cause of this problem is a mismatch in the innodb log
file size. The MySQL error log will contain messages like
InnoDB: Error: log file ./ib_logfile0 is of different size 0 5242880 bytes
InnoDB: than specified in the .cnf file 0 104857600 bytes!
Solution is to delete this log file and restart MySQL.
Doing an assembly, the database is very slow
The default parameters of MySQL are not suitable for large high-performance
databases. Especially, the innodb_buffer_pool_size must be increased.
100M is sufficient for one large project, but for many large projects it should
be 1G at minimum. For more see
http://dev.mysql.com/doc/refman/5.0/en/innodb-buffer-pool.html.
Note, this only affects usage during an assembly, when InnoDB tables are used.
|