Practical section: Genome annotation¶

Genome assembly and annotation workshop¶

Kasetsart University, Kamphaeng Saen Campus, August 6-9, 2018¶

Authors: Uthaipaisanwong P.

Last update: August 8, 2018

Content¶

1 Prerequisties

1.1 Upload the assembled sequence to your home directory

2 Annotation of coding regions

2.1 ab initio gene prediction
2.2 Homology-based functional annotation

3 Annotation of non-coding regions

3.1 tRNA
3.2 rRNA
3.3 microRNA

4 Genome component

4.1 misa
4.2 RepeatMasker

5 Convert result to GFF3 file format

6 CAB-Inhouse annotation pipeline (CABAnnot)

7 Genome visualization by JBrowse

1. Prerequisites¶

Augustus software must be installed on the system. (Currently used version is 3.3.1)
BLAST software must be installed on the system. (Currently used version is 2.7.0+)
RepeatMasker software must be installed on the system. (Currently used version is 4.0.7)
tRNAScan-SE software must be installed on the system. (Currently used version is 1.3.1)
misa software must be installed on the system. (Currently used version is on September 27, 2010)
miRanda software must be installed on the system. (Currently used version is 3.3a)

For participants, these software are already installed on our CAB server (158.108.144.7). You can just access the program on the CAB server (AgCipher) using the command line (shell).

1.1 Upload the assembled sequence to your home directory¶

Access CAB server (AgCipher)

user: annotest password: anno@cab18

ssh annotest@158.108.144.7

password: anno@cab18

input = "/home/annotest"

cd {input}

/home/annotest

On the CAB server, please check current working directory

pwd

'/home/annotest'

Make directory test

mkdir test

cd test

/home/annotest/test

Download assembled sequences for annotation from

/home/annotest/AnnotationData

cp /home/annotest/AnnotationData/contigs_nc.fasta .

ls -al

total 404
drwxr-xr-x  2 annotest annotest     37 Aug  7 17:38 ./
drwx--x--x 12 annotest annotest   4096 Aug  7 17:32 ../
-rwxr-xr-x  1 annotest annotest 408376 Aug  7 17:38 contigs_nc.fasta*

2. Annotation of coding regions¶

2.1 Gene perdiction based on ab initio method with augustus software¶

%%bash
/share/apps/augustus-3.3.1/bin/augustus

AUGUSTUS (3.3.1) is a gene prediction tool
written by M. Stanke, O. Keller, S. König, L. Gerischer and L. Romoth.

usage:
augustus [parameters] --species=SPECIES queryfilename

'queryfilename' is the filename (including relative path) to the file containing the query sequence(s)
in fasta format.

SPECIES is an identifier for the species. Use --species=help to see a list.

parameters:
--strand=both, --strand=forward or --strand=backward
--genemodel=partial, --genemodel=intronless, --genemodel=complete, --genemodel=atleastone or --genemodel=exactlyone
  partial      : allow prediction of incomplete genes at the sequence boundaries (default)
  intronless   : only predict single-exon genes like in prokaryotes and some eukaryotes
  complete     : only predict complete genes
  atleastone   : predict at least one complete gene
  exactlyone   : predict exactly one complete gene
--singlestrand=true
  predict genes independently on each strand, allow overlapping genes on opposite strands
  This option is turned off by default.
--hintsfile=hintsfilename
  When this option is used the prediction considering hints (extrinsic information) is turned on.
  hintsfilename contains the hints in gff format.
--AUGUSTUS_CONFIG_PATH=path
  path to config directory (if not specified as environment variable)
--alternatives-from-evidence=true/false
  report alternative transcripts when they are suggested by hints
--alternatives-from-sampling=true/false
  report alternative transcripts generated through probabilistic sampling
--sample=n
--minexonintronprob=p
--minmeanexonintronprob=p
--maxtracks=n
  For a description of these parameters see section 4 of README.TXT.
--proteinprofile=filename
  When this option is used the prediction will consider the protein profile provided as parameter.
  The protein profile extension is described in section 7 of README.TXT.
--progress=true
  show a progressmeter
--gff3=on/off
  output in gff3 format
--predictionStart=A, --predictionEnd=B
  A and B define the range of the sequence for which predictions should be found.
--UTR=on/off
  predict the untranslated regions in addition to the coding sequence. This currently works only for a subset of species.
--noInFrameStop=true/false
  Do not report transcripts with in-frame stop codons. Otherwise, intron-spanning stop codons could occur. Default: false
--noprediction=true/false
  If true and input is in genbank format, no prediction is made. Useful for getting the annotated protein sequences.
--uniqueGeneId=true/false
  If true, output gene identifyers like this: seqname.gN

For a complete list of parameters, type "augustus --paramlist".
An exhaustive description can be found in the file README.TXT.

%%bash
/share/apps/augustus-3.3.1/bin/augustus --genemodel=partial --protein=on --introns=on --start=on --stop=on --codingseq=on --gff3=on --UTR=on --uniqueGeneId=true --species=tomato  /home/annotest/test/contigs_nc.fasta > /home/annotest/test/contigs_nc.gff

list files

ls -al

total 544
drwxrwxrwx  2 annotest    annotest        62 Aug  7 17:39 ./
drwx--x--x 12 annotest    annotest      4096 Aug  7 17:32 ../
-rwxr-xr-x  1 annotest    annotest    408376 Aug  7 17:38 contigs_nc.fasta*
-rw-rw-r--  1 jupyterhub1 jupyterhub1 139838 Aug  7 17:40 contigs_nc.gff

more contigs_nc.gff

%%bash
perl /share/apps/augustus-3.3.1/scripts/getAnnoFasta.pl /home/annotest/test/contigs_nc.gff --seqfile=/home/annotest/test/contigs_nc.fasta

Read in 30 sequence(s) from /home/annotest/test/contigs_nc.fasta.

ls -al

total 724
drwxrwxrwx  2 annotest    annotest      4096 Aug  7 17:41 ./
drwx--x--x 12 annotest    annotest      4096 Aug  7 17:32 ../
-rw-rw-r--  1 jupyterhub1 jupyterhub1  16160 Aug  7 17:41 contigs_nc.aa
-rw-rw-r--  1 jupyterhub1 jupyterhub1  50981 Aug  7 17:41 contigs_nc.cdsexons
-rw-rw-r--  1 jupyterhub1 jupyterhub1  47348 Aug  7 17:41 contigs_nc.codingseq
-rwxr-xr-x  1 annotest    annotest    408376 Aug  7 17:38 contigs_nc.fasta*
-rw-rw-r--  1 jupyterhub1 jupyterhub1 139838 Aug  7 17:40 contigs_nc.gff
-rw-rw-r--  1 jupyterhub1 jupyterhub1  60302 Aug  7 17:41 contigs_nc.mrna

more contigs_nc.codingseq

Annotate function¶

BLAST coding sequences against viridiplantae_prot database

program: blastx
database: viridiplantae_prot
query: contigs_nc.codingseq
output format: outfmt=6

%%bash
/share/apps/blast/bin/blastx -db /home/pichapuk/DATABASE/PLANT/viridiplantae_prot -outfmt "6 qseqid sseqid pident evalue bitscore length mismatch gapopen frames qstart qend qlen sstart send slen sstrand stitle" -evalue 1e-5 -num_threads 32 -max_target_seqs 10 -out /home/annotest/test/contigs_nc.codingseq.viridiplantae_prot.6.blast -query /home/annotest/test/contigs_nc.codingseq

ls -al

total 832
drwxrwxrwx  2 annotest    annotest      4096 Aug  7 17:45 ./
drwx--x--x 12 annotest    annotest      4096 Aug  7 17:32 ../
-rw-rw-r--  1 jupyterhub1 jupyterhub1  16160 Aug  7 17:41 contigs_nc.aa
-rw-rw-r--  1 jupyterhub1 jupyterhub1  50981 Aug  7 17:41 contigs_nc.cdsexons
-rw-rw-r--  1 jupyterhub1 jupyterhub1  47348 Aug  7 17:41 contigs_nc.codingseq
-rw-rw-r--  1 jupyterhub1 jupyterhub1 108242 Aug  7 17:46 contigs_nc.codingseq.viridiplantae_prot.6.blast
-rwxr-xr-x  1 annotest    annotest    408376 Aug  7 17:38 contigs_nc.fasta*
-rw-rw-r--  1 jupyterhub1 jupyterhub1 139838 Aug  7 17:40 contigs_nc.gff
-rw-rw-r--  1 jupyterhub1 jupyterhub1  60302 Aug  7 17:41 contigs_nc.mrna

more contigs_nc.codingseq.viridiplantae_prot.6.blast

2.2 Gene prediction based on homology based method¶

BLAST assembled sequences against viridiplantae_prot database (outfmt=0)

program: blastx
database: viridiplantae_prot
query: contigs_nc.fasta
output format: outfmt=0

%%bash
#/share/apps/blast/bin/blastx -db /home/pichapuk/DATABASE/PLANT/viridiplantae_prot -outfmt 0 -evalue 1e-60 -max_hsps 10 -num_threads 32  -out /home/annotest/test/contigs_nc.fasta.viridiplantae_prot.0.blast -query /home/annotest/test/contigs_nc.fasta

ls -lt

total 5400
-rw-rw-r-- 1 jupyterhub1 jupyterhub1 4684526 Aug  7 17:52 contigs_nc.fasta.viridiplantae_prot.0.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  108242 Aug  7 17:46 contigs_nc.codingseq.viridiplantae_prot.6.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   16160 Aug  7 17:41 contigs_nc.aa
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   50981 Aug  7 17:41 contigs_nc.cdsexons
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   47348 Aug  7 17:41 contigs_nc.codingseq
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   60302 Aug  7 17:41 contigs_nc.mrna
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  139838 Aug  7 17:40 contigs_nc.gff
-rwxr-xr-x 1 annotest    annotest     408376 Aug  7 17:38 contigs_nc.fasta*

more contigs_nc.fasta.viridiplantae_prot.0.blast

3. Annotation of non-coding regions¶

3.1 Annotate tRNA with tRNAScan-SE software¶

%%bash
tRNAscan-SE -h

Usage: tRNAscan-SE [-options] <FASTA file(s)>

  Scan a sequence file for tRNAs using tRNAscan, EufindtRNA &
   tRNA covariance models
   -- defaults to use with eukaryotic sequences 
      (use 'Search Mode Options' below to scan other types of sequences)

Search Mode Options:

  -B  --bact            : search for bacterial tRNAs (use bacterial tRNA model)
  -A  --arch            : search for archaeal tRNAs (use archaeal tRNA model)
  -O  --organ           : search for organellar (mitochondrial/chloroplast) tRNAs
  -G  --general         : use general tRNA model (cytoplasmic tRNAs from all 3 domains included)

  -C  --cove            : search using covariance model analysis only (max sensitivity, slow)
  -i  --infernal        : search using Infernal cm analysis only (max sensitivity, very slow)
      --newscan         : search using Infernal and new cm models instead of Cove
  -H  --breakdown       : show breakdown of primary and secondary structure components to
                            covariance model bit scores
  -D  --nopseudo        : disable pseudogene checking

Achaeal-specific options:

      --ncintron        : scan for noncanonical introns
      --frag <file>     : scan for putative tRNA gene fragments that may form split tRNAs
                            and save results in <file>

Output options:

  -o  --output <file>   : save final results in <file>
  -f  --struct <file>   : save tRNA secondary structures to <file>
  -a  --acedb           : output results in ACeDB output format instead of default
                            tabular format
  -m  --stats <file>    : save statistics summary for run in <file>
                            (speed, # tRNAs found in each part of search, etc)

  -d  --progress        : display program progress messages
  -l  --log <file>      : save log of program progress in <file>

  -q  --quiet           : quiet mode (credits & run option selections suppressed)
  -b  --brief           : brief output format (no column headers)

  -N  --codons          : output corresponding codons instead of tRNA anticodons

  -? #                 : '#' in place of <file> chooses default name for output files
  -p  --prefix <label>  : use <label> prefix for all default output file names

  -y  --hitsrc          : show origin of first-pass hits (Ts=tRNAscan 1.4,
                            Eu=EufindtRNA, Bo= Both)

Specify Alternate Cutoffs / Data Files:

  -X  --score <score>   : set cutoff score (in bits) for reporting tRNAs (default=20)
  -L  --len <length>    : set max length of tRNA intron+variable region (default=116bp)

  -I  --iscore <score>  : manually set "intermediate" cutoff score for EufindtRNA
  -z  --pad <number>    : use <number> nucleotides padding when passing first-pass
                            tRNA bounds predictions to CM analysis (default=8)

  -g  --gencode <file>  : use alternate genetic codes specified in <file> for
                            determining tRNA type
  -c  --cm <file>       : use an alternate covariance model in <file>

Misc Options:

  -h  --help            : print this help message
  -Q  --forceow         : do not prompt user before overwriting pre-existing
                            result files  (for batch processing)

  -n  --match <EXPR>    : search only sequences with names matching <EXPR> string
                            (<EXPR> may contain * or ? wildcard chars)
  -s  --search <EXPR>   : start search at sequence with name matching <EXPR> string
                            and continue to end of input sequence file(s)
Special Options (for testing & special purposes)

  -T  --tscan           : search using tRNAscan only (defaults to strict params)
  -t  --tmode <mode>    : explicitly set tRNAscan params, where <mode>=R or S
                            (R=relaxed, S=strict tRNAscan v1.3 params)

  -E  --eufind          : search using Eukaryotic tRNA finder (EufindtRNA) only
                            (defaults to Normal seach parameters when run alone,
                            or to Relaxed search params when run with Cove)
  -e  --emode <mode>    : explicitly set EufindtRNA params, where <mode>=R, N, or S
                            (relaxed, normal, or strict)

  -r  --fsres <file>    : save first-pass scan results from EufindtRNA and/or
                            tRNAscan in <file> in tabular results format
  -u  --filter <file>   : search with Cove only those sequences & regions delimited
                            in <file> (tabular results file format)
  -F  --falsepos <file> : save first-pass candidate tRNAs in <file> that were then
                            found to be false positives by Cove analysis
  -M  --missed <file>   : save all seqs that do NOT have at least one
                            tRNA prediction in them (aka "missed" seqs)
  -v  --verbose <file>  : save verbose tRNAscan 1.3 output to <file>
  -V  --version <vers>  : run an alternate version of tRNAscan
                            where <vers> = 1.3, 1.39, 1.4 (default), or 2.0
      --nomerge         : Keep redundant tRNAscan 1.3 hits (don't filter out multiple
                            predictions per tRNA identification)

tRNAscan-SE 1.3.1 (January 2012)

  Please cite: 
	Lowe, T.M. & Eddy, S.R. (1997) "tRNAscan-SE: A program for
	improved detection of transfer RNA genes in genomic sequence"
	Nucl. Acids Res. 25: 955-964.

  This program uses a modified, optimized version of tRNAscan v1.3
  (Fichant & Burks, J. Mol. Biol. 1991, 220: 659-671),
  a new implementation of a multistep weight matrix algorithm
  for identification of eukaryotic tRNA promoter regions
  (Pavesi et al., Nucl. Acids Res. 1994, 22: 1247-1256),
  as well as the RNA covariance analysis package Cove v.2.4.2
  (Eddy & Durbin, Nucl. Acids Res. 1994, 22: 2079-2088).

%%bash
tRNAscan-SE -o /home/annotest/test/contigs_nc.tRNA -f /home/annotest/test/contigs_nc.tRNACove -m /home/annotest/test/contigs_nc.tRNAStat /home/annotest/test/contigs_nc.fasta

tRNAscan-SE v.1.3.1 (January 2012) - scan sequences for transfer RNAs

  Please cite: 
	Lowe, T.M. & Eddy, S.R. (1997) "tRNAscan-SE: A program for
	improved detection of transfer RNA genes in genomic sequence"
	Nucl. Acids Res. 25: 955-964.

  This program uses a modified, optimized version of tRNAscan v1.3
  (Fichant & Burks, J. Mol. Biol. 1991, 220: 659-671),
  a new implementation of a multistep weight matrix algorithm
  for identification of eukaryotic tRNA promoter regions
  (Pavesi et al., Nucl. Acids Res. 1994, 22: 1247-1256),
  as well as the RNA covariance analysis package Cove v.2.4.2
  (Eddy & Durbin, Nucl. Acids Res. 1994, 22: 2079-2088).

------------------------------------------------------------
Sequence file(s) to search:  /home/annotest/test/contigs_nc.fasta
Search Mode:                 Eukaryotic
Results written to:          /home/annotest/test/contigs_nc.tRNA
Output format:               Tabular
Searching with:              tRNAscan + EufindtRNA -> Cove
Covariance model:            TRNA2-euk.cm
tRNAscan parameters:         Strict
EufindtRNA parameters:       Relaxed (Int Cutoff= -32.1)
tRNA secondary structure
    predictions saved to:    /home/annotest/test/contigs_nc.tRNACove
Search log saved in:         
Search statistics saved in:  /home/annotest/test/contigs_nc.tRNAStat
------------------------------------------------------------

ls -lt

total 5412
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    3459 Aug  7 17:59 contigs_nc.tRNAStat
-rw-rw-r-- 1 jupyterhub1 jupyterhub1     277 Aug  7 17:59 contigs_nc.tRNA
-rw-rw-r-- 1 jupyterhub1 jupyterhub1     683 Aug  7 17:59 contigs_nc.tRNACove
-rw-rw-r-- 1 jupyterhub1 jupyterhub1 4684526 Aug  7 17:52 contigs_nc.fasta.viridiplantae_prot.0.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  108242 Aug  7 17:46 contigs_nc.codingseq.viridiplantae_prot.6.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   16160 Aug  7 17:41 contigs_nc.aa
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   50981 Aug  7 17:41 contigs_nc.cdsexons
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   47348 Aug  7 17:41 contigs_nc.codingseq
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   60302 Aug  7 17:41 contigs_nc.mrna
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  139838 Aug  7 17:40 contigs_nc.gff
-rwxr-xr-x 1 annotest    annotest     408376 Aug  7 17:38 contigs_nc.fasta*

more contigs_nc.tRNA

more contigs_nc.tRNAStat

3.2 Annotate rRNA with BLAST software¶

BLAST assembled sequences against plant ribosomal RNA database (NCBI; August 2018)

program: blastn
database: rRNA_palnt_ncbi_edit
query: contigs_nc.fasta
output format: outfmt=6

%%bash
/share/apps/blast/bin/blastn -help

USAGE
  blastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-perc_identity float_value]
    [-qcov_hsp_perc float_value] [-max_hsps int_value]
    [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
    [-min_raw_gapped_score int_value] [-template_type type]
    [-template_length int_value] [-dust DUST_options]
    [-filtering_db filtering_database]
    [-window_masker_taxid window_masker_taxid]
    [-window_masker_db window_masker_db] [-soft_masking soft_masking]
    [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
    [-best_hit_score_edge float_value] [-window_size int_value]
    [-off_diagonal_range int_value] [-use_index boolean] [-index_name string]
    [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-version]

DESCRIPTION
   Nucleotide-Nucleotide BLAST 2.7.1+

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -version
   Print version number;  ignore other arguments

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
 -query_loc <String>
   Location on the query sequence in 1-based offsets (Format: start-stop)
 -strand <String, `both', `minus', `plus'>
   Query strand(s) to search against database/subject
   Default = `both'

 *** General search options
 -task <String, Permissible values: 'blastn' 'blastn-short' 'dc-megablast'
                'megablast' 'rmblastn' >
   Task to execute
   Default = `megablast'
 -db <String>
   BLAST database name
    * Incompatible with:  subject, subject_loc
 -out <File_Out>
   Output file name
   Default = `-'
 -evalue <Real>
   Expectation value (E) threshold for saving hits 
   Default = `10'
 -word_size <Integer, >=4>
   Word size for wordfinder algorithm (length of best perfect match)
 -gapopen <Integer>
   Cost to open a gap
 -gapextend <Integer>
   Cost to extend a gap
 -penalty <Integer, <=0>
   Penalty for a nucleotide mismatch
 -reward <Integer, >=0>
   Reward for a nucleotide match
 -use_index <Boolean>
   Use MegaBLAST database index
   Default = `false'
 -index_name <String>
   MegaBLAST database index name (deprecated; use only for old style indices)

 *** BLAST-2-Sequences options
 -subject <File_In>
   Subject sequence(s) to search
    * Incompatible with:  db, gilist, seqidlist, negative_gilist,
   negative_seqidlist, db_soft_mask, db_hard_mask
 -subject_loc <String>
   Location on the subject sequence in 1-based offsets (Format: start-stop)
    * Incompatible with:  db, gilist, seqidlist, negative_gilist,
   negative_seqidlist, db_soft_mask, db_hard_mask, remote

 *** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    17 = Sequence Alignment/Map (SAM),
    18 = Organism Report
   
   Options 6, 7, 10 and 17 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers for options 6, 7 and 10 are:
   	    qseqid means Query Seq-id
   	       qgi means Query GI
   	      qacc means Query accesion
   	   qaccver means Query accesion.version
   	      qlen means Query sequence length
   	    sseqid means Subject Seq-id
   	 sallseqid means All subject Seq-id(s), separated by a ';'
   	       sgi means Subject GI
   	    sallgi means All subject GIs
   	      sacc means Subject accession
   	   saccver means Subject accession.version
   	   sallacc means All subject accessions
   	      slen means Subject sequence length
   	    qstart means Start of alignment in query
   	      qend means End of alignment in query
   	    sstart means Start of alignment in subject
   	      send means End of alignment in subject
   	      qseq means Aligned part of query sequence
   	      sseq means Aligned part of subject sequence
   	    evalue means Expect value
   	  bitscore means Bit score
   	     score means Raw score
   	    length means Alignment length
   	    pident means Percentage of identical matches
   	    nident means Number of identical matches
   	  mismatch means Number of mismatches
   	  positive means Number of positive-scoring matches
   	   gapopen means Number of gap openings
   	      gaps means Total number of gaps
   	      ppos means Percentage of positive-scoring matches
   	    frames means Query and subject frames separated by a '/'
   	    qframe means Query frame
   	    sframe means Subject frame
   	      btop means Blast traceback operations (BTOP)
   	    staxid means Subject Taxonomy ID
   	  ssciname means Subject Scientific Name
   	  scomname means Subject Common Name
   	sblastname means Subject Blast Name
   	 sskingdom means Subject Super Kingdom
   	   staxids means unique Subject Taxonomy ID(s), separated by a ';'
   			 (in numerical order)
   	 sscinames means unique Subject Scientific Name(s), separated by a ';'
   	 scomnames means unique Subject Common Name(s), separated by a ';'
   	sblastnames means unique Subject Blast Name(s), separated by a ';'
   			 (in alphabetical order)
   	sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
   			 (in alphabetical order) 
   	    stitle means Subject Title
   	salltitles means All Subject Title(s), separated by a '<>'
   	   sstrand means Subject Strand
   	     qcovs means Query Coverage Per Subject
   	   qcovhsp means Query Coverage Per HSP
   	    qcovus means Query Coverage Per Unique Subject (blastn only)
   When not provided, the default value is:
   'qaccver saccver pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   The supported format specifier for option 17 is:
   	        SQ means Include Sequence Data
   	        SR means Subject as Reference Seq
   Default = `0'
 -show_gis
   Show NCBI GIs in deflines?
 -num_descriptions <Integer, >=0>
   Number of database sequences to show one-line descriptions for
   Not applicable for outfmt > 4
   Default = `500'
    * Incompatible with:  max_target_seqs
 -num_alignments <Integer, >=0>
   Number of database sequences to show alignments for
   Default = `250'
    * Incompatible with:  max_target_seqs
 -line_length <Integer, >=1>
   Line length for formatting alignments
   Not applicable for outfmt > 4
   Default = `60'
 -html
   Produce HTML output?

 *** Query filtering options
 -dust <String>
   Filter query sequence with DUST (Format: 'yes', 'level window linker', or
   'no' to disable)
   Default = `20 64 1'
 -filtering_db <String>
   BLAST database containing filtering elements (i.e.: repeats)
 -window_masker_taxid <Integer>
   Enable WindowMasker filtering using a Taxonomic ID
 -window_masker_db <String>
   Enable WindowMasker filtering using this repeats database.
 -soft_masking <Boolean>
   Apply filtering locations as soft masks
   Default = `true'
 -lcase_masking
   Use lower case filtering in query and subject sequence(s)?

 *** Restrict search or results
 -gilist <String>
   Restrict search of database to list of GI's
    * Incompatible with:  negative_gilist, seqidlist, negative_seqidlist,
   remote, subject, subject_loc
 -seqidlist <String>
   Restrict search of database to list of SeqId's
    * Incompatible with:  gilist, negative_gilist, negative_seqidlist, remote,
   subject, subject_loc
 -negative_gilist <String>
   Restrict search of database to everything except the listed GIs
    * Incompatible with:  gilist, seqidlist, remote, subject, subject_loc
 -negative_seqidlist <String>
   Restrict search of database to everything except the listed SeqIDs
    * Incompatible with:  gilist, seqidlist, remote, subject, subject_loc
 -entrez_query <String>
   Restrict search with the given Entrez query
    * Requires:  remote
 -db_soft_mask <String>
   Filtering algorithm ID to apply to the BLAST database as soft masking
    * Incompatible with:  db_hard_mask, subject, subject_loc
 -db_hard_mask <String>
   Filtering algorithm ID to apply to the BLAST database as hard masking
    * Incompatible with:  db_soft_mask, subject, subject_loc
 -perc_identity <Real, 0..100>
   Percent identity
 -qcov_hsp_perc <Real, 0..100>
   Percent query coverage per hsp
 -max_hsps <Integer, >=1>
   Set maximum number of HSPs per subject sequence to save for each query
 -culling_limit <Integer, >=0>
   If the query range of a hit is enveloped by that of at least this many
   higher-scoring hits, delete the hit
    * Incompatible with:  best_hit_overhang, best_hit_score_edge
 -best_hit_overhang <Real, (>0 and <0.5)>
   Best Hit algorithm overhang value (recommended value: 0.1)
    * Incompatible with:  culling_limit
 -best_hit_score_edge <Real, (>0 and <0.5)>
   Best Hit algorithm score edge value (recommended value: 0.1)
    * Incompatible with:  culling_limit
 -max_target_seqs <Integer, >=1>
   Maximum number of aligned sequences to keep 
   Not applicable for outfmt <= 4
   Default = `500'
    * Incompatible with:  num_descriptions, num_alignments

 *** Discontiguous MegaBLAST options
 -template_type <String, `coding', `coding_and_optimal', `optimal'>
   Discontiguous MegaBLAST template type
    * Requires:  template_length
 -template_length <Integer, Permissible values: '16' '18' '21' >
   Discontiguous MegaBLAST template length
    * Requires:  template_type

 *** Statistical options
 -dbsize <Int8>
   Effective length of the database 
 -searchsp <Int8, >=0>
   Effective length of the search space
 -sum_stats <Boolean>
   Use sum statistics

 *** Search strategy options
 -import_search_strategy <File_In>
   Search strategy to use
    * Incompatible with:  export_search_strategy
 -export_search_strategy <File_Out>
   File name to record the search strategy used
    * Incompatible with:  import_search_strategy

 *** Extension options
 -xdrop_ungap <Real>
   X-dropoff value (in bits) for ungapped extensions
 -xdrop_gap <Real>
   X-dropoff value (in bits) for preliminary gapped extensions
 -xdrop_gap_final <Real>
   X-dropoff value (in bits) for final gapped alignment
 -no_greedy
   Use non-greedy dynamic programming extension
 -min_raw_gapped_score <Integer>
   Minimum raw gapped score to keep an alignment in the preliminary gapped and
   traceback stages
 -ungapped
   Perform ungapped alignment only?
 -window_size <Integer, >=0>
   Multiple hits window size, use 0 to specify 1-hit algorithm
 -off_diagonal_range <Integer, >=0>
   Number of off-diagonals to search for the 2nd hit, use 0 to turn off
   Default = `0'

 *** Miscellaneous options
 -parse_deflines
   Should the query and subject defline(s) be parsed?
 -num_threads <Integer, (>=1 and =<144)>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote
 -remote
   Execute search remotely?
    * Incompatible with:  gilist, seqidlist, negative_gilist,
   negative_seqidlist, subject_loc, num_threads

%%bash
/share/apps/blast/bin/blastn -db /home/pichapuk/DATABASE/RNA/rRNA_plant_ncbi_edit -outfmt "6 qseqid sseqid pident evalue bitscore length mismatch gapopen frames qstart qend qlen sstart send slen sstrand stitle" -evalue 1e-5 -word_size 11 -show_gis  -max_target_seqs 10 -num_threads 32 -out /home/annotest/test/contigs_nc.ribosomeRNA.6.blast -query /home/annotest/test/contigs_nc.fasta

more contigs_nc.ribosomeRNA.6.blast

3.3 Annotate microRNA with miRanda software¶

%%bash
/share/apps/miRanda-3.3a/src/miranda /home/pichapuk/mirbase/21/mature.fa /home/annotest/test/contigs_nc.fasta > /home/annotest/test/contigs_nc.mirbase.out

more contigs_nc.mirbase.out

4. Genome component¶

4.1 Annotate repeats with repeatMasker software¶

%%bash
perl /share/apps/RepeatMasker/RepeatMasker -h

RepeatMasker version open-4.0.7
NAME
    RepeatMasker - Mask repetitive DNA

SYNOPSIS
      RepeatMasker [-options] <seqfiles(s) in fasta format>

DESCRIPTION
    The options are:

    -h(elp)
        Detailed help

    Default settings are for masking all type of repeats in a primate
    sequence.

    -e(ngine) [crossmatch|wublast|abblast|ncbi|hmmer|decypher]
        Use an alternate search engine to the default.

    -pa(rallel) [number]
        The number of processors to use in parallel (only works for batch
        files or sequences over 50 kb)

    -s  Slow search; 0-5% more sensitive, 2-3 times slower than default

    -q  Quick search; 5-10% less sensitive, 2-5 times faster than default

    -qq Rush job; about 10% less sensitive, 4->10 times faster than default
        (quick searches are fine under most circumstances) repeat options

    -nolow /-low
        Does not mask low_complexity DNA or simple repeats

    -noint /-int
        Only masks low complex/simple repeats (no interspersed repeats)

    -norna
        Does not mask small RNA (pseudo) genes

    -alu
        Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)

    -div [number]
        Masks only those repeats < x percent diverged from consensus seq

    -lib [filename]
        Allows use of a custom library (e.g. from another species)

    -cutoff [number]
        Sets cutoff score for masking repeats when using -lib (default 225)

    -species <query species>
        Specify the species or clade of the input sequence. The species name
        must be a valid NCBI Taxonomy Database species name and be contained
        in the RepeatMasker repeat database. Some examples are:

          -species human
          -species mouse
          -species rattus
          -species "ciona savignyi"
          -species arabidopsis

        Other commonly used species:

        mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,
        danio, "ciona intestinalis" drosophila, anopheles, elegans,
        diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize

    Contamination options

    -is_only
        Only clips E coli insertion elements out of fasta and .qual files

    -is_clip
        Clips IS elements before analysis (default: IS only reported)

    -no_is
        Skips bacterial insertion element check

    Running options

    -gc [number]
        Use matrices calculated for 'number' percentage background GC level

    -gccalc
        RepeatMasker calculates the GC content even for batch files/small
        seqs

    -frag [number]
        Maximum sequence length masked without fragmenting (default 60000,
        300000 for DeCypher)

    -nocut
        Skips the steps in which repeats are excised

    -noisy
        Prints search engine progress report to screen (defaults to .stderr
        file)

    -nopost
        Do not postprocess the results of the run ( i.e. call ProcessRepeats
        ). NOTE: This options should only be used when ProcessRepeats will
        be run manually on the results.

    output options

    -dir [directory name]
        Writes output to this directory (default is query file directory,
        "-dir ." will write to current directory).

    -a(lignments)
        Writes alignments in .align output file

    -inv
        Alignments are presented in the orientation of the repeat (with
        option -a)

    -lcambig
        Outputs ambiguous DNA transposon fragments using a lower case name.
        All other repeats are listed in upper case. Ambiguous fragments
        match multiple repeat elements and can only be called based on
        flanking repeat information.

    -small
        Returns complete .masked sequence in lower case

    -xsmall
        Returns repetitive regions in lowercase (rest capitals) rather than
        masked

    -x  Returns repetitive regions masked with Xs rather than Ns

    -poly
        Reports simple repeats that may be polymorphic (in file.poly)

    -source
        Includes for each annotation the HSP "evidence". Currently this
        option is only available with the "-html" output format listed
        below.

    -html
        Creates an additional output file in xhtml format.

    -ace
        Creates an additional output file in ACeDB format

    -gff
        Creates an additional Gene Feature Finding format output

    -u  Creates an additional annotation file not processed by
        ProcessRepeats

    -xm Creates an additional output file in cross_match format (for
        parsing)

    -no_id
        Leaves out final column with unique ID for each element (was
        default)

    -e(xcln)
        Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in
        the query

SEE ALSO
        Crossmatch, ProcessRepeats

COPYRIGHT
    Copyright 2007-2014 Arian Smit, Institute for Systems Biology

AUTHORS
    Arian Smit <asmit@systemsbiology.org>

    Robert Hubley <rhubley@systemsbiology.org>

Option h is ambiguous (help, html)

%%bash
perl /share/apps/RepeatMasker/RepeatMasker /home/annotest/test/contigs_nc.fasta -species arabidopsis -gff

RepeatMasker version open-4.0.7
Search Engine: NCBI/RMBLAST [ 2.2.27+ ]
Master RepeatMasker Database: /home/share/apps/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127-rb20170127 )


Building general libraries in: /home/jupyter1/.RepeatMaskerCache/dc20170127-rb20170127/general
Building species libraries in: /home/jupyter1/.RepeatMaskerCache/dc20170127-rb20170127/arabidopsis
   - 201 ancestral and ubiquitous sequence(s) for arabidopsis
   - 955 lineage specific sequence(s) for arabidopsis

analyzing file /home/annotest/test/contigs_nc.fasta

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 6
identifying matches to arabidopsis sequences in batch 1 of 6
identifying Simple Repeats in batch 1 of 6

Checking for E. coli insertion elements
identifying Simple Repeats in batch 2 of 6
identifying matches to arabidopsis sequences in batch 2 of 6
identifying Simple Repeats in batch 2 of 6

Checking for E. coli insertion elements
identifying Simple Repeats in batch 3 of 6
identifying matches to arabidopsis sequences in batch 3 of 6
identifying Simple Repeats in batch 3 of 6

Checking for E. coli insertion elements
identifying Simple Repeats in batch 4 of 6
identifying matches to arabidopsis sequences in batch 4 of 6
identifying Simple Repeats in batch 4 of 6

Checking for E. coli insertion elements
identifying Simple Repeats in batch 5 of 6
identifying matches to arabidopsis sequences in batch 5 of 6
identifying Simple Repeats in batch 5 of 6

Checking for E. coli insertion elements
identifying Simple Repeats in batch 6 of 6
identifying matches to arabidopsis sequences in batch 6 of 6
identifying Simple Repeats in batch 6 of 6
processing output: 
cycle 1 
cycle 2 
cycle 3 
cycle 4 
cycle 5 
cycle 6 
cycle 7 
cycle 8 
cycle 9 
cycle 10 
Generating output... 
masking
done

ls -lt

total 8244
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    2421 Aug  7 18:12 contigs_nc.fasta.tbl
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  410669 Aug  7 18:12 contigs_nc.fasta.masked
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   25221 Aug  7 18:12 contigs_nc.fasta.out
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   18649 Aug  7 18:12 contigs_nc.fasta.out.gff
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  326437 Aug  7 18:12 contigs_nc.fasta.cat
-rw-rw-r-- 1 jupyterhub1 jupyterhub1 2098183 Aug  7 18:10 contigs_nc.mirbase.out
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    2978 Aug  7 18:09 contigs_nc.ribosomeRNA.6.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    3459 Aug  7 17:59 contigs_nc.tRNAStat
-rw-rw-r-- 1 jupyterhub1 jupyterhub1     277 Aug  7 17:59 contigs_nc.tRNA
-rw-rw-r-- 1 jupyterhub1 jupyterhub1     683 Aug  7 17:59 contigs_nc.tRNACove
-rw-rw-r-- 1 jupyterhub1 jupyterhub1 4684526 Aug  7 17:52 contigs_nc.fasta.viridiplantae_prot.0.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  108242 Aug  7 17:46 contigs_nc.codingseq.viridiplantae_prot.6.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   16160 Aug  7 17:41 contigs_nc.aa
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   50981 Aug  7 17:41 contigs_nc.cdsexons
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   47348 Aug  7 17:41 contigs_nc.codingseq
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   60302 Aug  7 17:41 contigs_nc.mrna
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  139838 Aug  7 17:40 contigs_nc.gff
-rwxr-xr-x 1 annotest    annotest     408376 Aug  7 17:38 contigs_nc.fasta*

more contigs_nc.fasta.masked

more contigs_nc.fasta.tbl

more contigs_nc.fasta.out

4.2 Annotate microsatellite with misa software¶

%%bash
perl /share/apps/CABAnnot1.0/scripts/misa.pl

_______________________________________________________________________________
DESCRIPTION: Tool for the identification and localization of
             (I)  perfect microsatellites as well as
             (II) compound microsatellites (two individual microsatellites,
                  disrupted by a certain number of bases)
SYNTAX:   misa.pl <FASTA file> misa.ini
   <FASTAfile>    Single file in FASTA format containing the sequence(s).
   In order to specify the search criteria, an additional file containing
   the microsatellite search parameters is required named "misa.ini", which
   has the following structure:
     (a) Following a text string beginning with 'def', pairs of numbers are
         expected, whereas the first number defines the unit size and the
         second number the lower threshold of repeats for that specific unit.
     (b) Following a text string beginning with 'int' a single number defines
         the maximal number of bases between two adjacent microsatellites in
         order to specify the compound microsatellite type.
   Example:
     definition(unit_size,min_repeats):          1-10 2-6 3-5 4-5 5-5 6-5
     interruptions(max_difference_for_2_SSRs):   100
EXAMPLE: misa.pl seqs.fasta misa.ini
_______________________________________________________________________________

%%bash
more /share/apps/CABAnnot1.0/scripts/misa.ini

::::::::::::::
/share/apps/CABAnnot1.0/scripts/misa.ini
::::::::::::::
definition(unit_size,min_repeats):                   1-20 2-7 3-7 4-5 5-5 6-5
interruptions(max_difference_between_2_SSRs):        100

%%bash
perl /share/apps/CABAnnot1.0/scripts/misa.pl /home/annotest/test/contigs_nc.fasta /share/apps/CABAnnot1.0/scripts/misa.ini

ls -lt

total 8252
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    1018 Aug  7 18:14 contigs_nc.fasta.misa
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    1688 Aug  7 18:14 contigs_nc.fasta.statistics
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    2421 Aug  7 18:12 contigs_nc.fasta.tbl
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  410669 Aug  7 18:12 contigs_nc.fasta.masked
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   25221 Aug  7 18:12 contigs_nc.fasta.out
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   18649 Aug  7 18:12 contigs_nc.fasta.out.gff
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  326437 Aug  7 18:12 contigs_nc.fasta.cat
-rw-rw-r-- 1 jupyterhub1 jupyterhub1 2098183 Aug  7 18:10 contigs_nc.mirbase.out
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    2978 Aug  7 18:09 contigs_nc.ribosomeRNA.6.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1    3459 Aug  7 17:59 contigs_nc.tRNAStat
-rw-rw-r-- 1 jupyterhub1 jupyterhub1     277 Aug  7 17:59 contigs_nc.tRNA
-rw-rw-r-- 1 jupyterhub1 jupyterhub1     683 Aug  7 17:59 contigs_nc.tRNACove
-rw-rw-r-- 1 jupyterhub1 jupyterhub1 4684526 Aug  7 17:52 contigs_nc.fasta.viridiplantae_prot.0.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  108242 Aug  7 17:46 contigs_nc.codingseq.viridiplantae_prot.6.blast
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   16160 Aug  7 17:41 contigs_nc.aa
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   50981 Aug  7 17:41 contigs_nc.cdsexons
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   47348 Aug  7 17:41 contigs_nc.codingseq
-rw-rw-r-- 1 jupyterhub1 jupyterhub1   60302 Aug  7 17:41 contigs_nc.mrna
-rw-rw-r-- 1 jupyterhub1 jupyterhub1  139838 Aug  7 17:40 contigs_nc.gff
-rwxr-xr-x 1 annotest    annotest     408376 Aug  7 17:38 contigs_nc.fasta*

more contigs_nc.fasta.misa

more contigs_nc.fasta.statistics

5. Convert result to GFF3 file format¶

5.1 Convert blast output (outfmt=6) table to gff3 file¶

%%bash
perl /share/apps/CABAnnot1.0/scripts/blast2gff3.pl /home/annotest/test/contigs_nc.codingseq.viridiplantae_prot.6.blast /home/annotest/test/contigs_nc.gff

38--sequence blast hit
48--multiple hit

more contigs_nc.codingseq.viridiplantae_prot.6.annotate.gff

5.2 Convert tRNA output to gff3 file¶

%%bash
perl /share/apps/CABAnnot1.0/scripts/trna2gff3.pl /home/annotest/test/contigs_nc.tRNA

more contigs_nc.tRNA.gff

5.3 Convert rRNA blast output to gff3 file¶

%%bash
perl /share/apps/CABAnnot1.0/scripts/rrna2gff3.pl /home/annotest/test/contigs_nc.ribosomeRNA.6.blast

more contigs_nc.ribosomeRNA.6.blast.gff

5.4 Convert repeatMasker output to gff3 file¶

%%bash
perl /share/apps/CABAnnot1.0/scripts/repeatMasker2gff3.pl /home/annotest/test/contigs_nc.fasta.out

more contigs_nc.fasta.out.gff

5.5 Convert misa output to gff3 file¶

%%bash
perl /share/apps/CABAnnot1.0/scripts/misa_gff3_converter.pl --input /home/annotest/test/contigs_nc.fasta.misa --output /home/annotest/test/contigs_nc.misa.gff

more contigs_nc.misa.gff

5. CAB-Inhouse annotation pipeline¶

Make directory annotation at your home directory /home/annotest

cd /home/annotest/

/home/annotest

mkdir annotation

cd /home/annotest/annotation

/home/annotest/annotation

*** Check point for participants (your home dir)¶

pwd

'/home/annotest/annotation'

cp /home/annotest/AnnotationData/contigs.fa /home/annotest/annotation

Query sequences longer than 11900 bp¶

%%bash
/share/apps/bbmap/reformat.sh in=contigs.fa out=contigs.cut.fa minlength=11900

screen -S annotation

%%bash
/share/apps/CABAnnot1.0/bin/CABAnnot.pl -h

NAME
      CABAnnot1.0: CAB Annotation pipeline

SYNOPSIS
      perl CABAnnot.pl --input sequences.fasta --sequenceType "nuclear" --organismName "Tectona grandis" --version 1

      perl CABAnnot.pl --input sequences.fasta --sequenceType "mitochondrial" --organismName "Tectona grandis" --version 1 --runSequencePreprocess 0 --runSequenceAnalysis 1  --chooseCodingProgram 3

      perl CABAnnot.pl --input TgS01V4_13Nov17.fasta --sequenceType "chloroplast" --organismName "Tectona grandis" --version 1 --runSequencePreprocess 1 --runSequenceAnalysis 2 --chooseNoncodingProgram "1,2,4"

DESCRIPTION
      Sequence analysis program includes preprocessing, coding, and noncoding sequence analysis. 

      Last update: 16 January 2016

OPTIONS
    --input, -i An input file in fasta format. [STRING]
    --sequenceType, -t "nuclear", "chloroplast", "mitochondrial" [STRING]
    Default value is "nuclear".
    --organismName, -o "Tactona gradis", "Arabidopsis thaliana" [STRING]
    Default value is "Tectona grandis".
    --version, -v Version of genomic dataset 1, 2, 3, 4, 5, 6, .. n [INTEGER]
    Default value is "1".
    --runSequencePreprocess, -rsp Choose sequence preprocess 1=turn on, 0=turn
    off [INTEGER] Default value is "1". This sequence preprocess step is
    required before running sequence analysis.
    --runSequenceAnalysis, -rsa Choose analysis 0=run all analyses, 1=run only
    coding analysis, 2=run only noncoding analysis [INTEGER] Default value is
    "0".
    --chooseCodingProgram, -ccp Choose program 0=All, 1=Augustus, 2=BLAST,
    3=BLASTNR [INTEGER] Default value is "0".
    --chooseNoncodingProgram, -cnp Choose program 0, 1, 2, 3, 4, 5, "1,2",
    "1,3,4" [STRING] 0=All programs, 1=RepeatMasker, 2=tRNAScan-SE,
    3=ribosomalRNA, 4=misa, 5=microRNA Default value is "0"
    --augustusSpecies, -as "tomato", "arabidopsis", "human" [STRING] Default
    value is tomato for Augustus program.
    --proteinBlastDB, -pbd Take the path and file of local database. [STRING]
    Default value is "/DATABASE/PLANT/viridiplantae_prot".
    --nrBlastDB, -nbd Take the path of non-redundant protein database [STRING]
    Default value is "/DATABASE/NR/nr".
    --rrnaBlastDB, -rbd Take the path of ribosomalRNA database. [STRING]
    Default value is "/DATABASE/RNA/rRNA_plant_ncbi_edit".
    --evalue, -e Evalue for Blastx program [STRING] Default value is "1e-5".
    --numThreads, -n Number of threads for BLAST search program [INTEGER]
    Default value is "32"
    --maxTagetSeqs, -m Type the maximun target sequence number 1..400
    [INTEGER] Default value is "10".
    --help, -h Prints this help message.

AUTHOR
    Uthaipaisanwong P.

    PRESS "Q" to quit help.

Run CABAnnot1.0¶

#/share/apps/CABAnnot1.0/bin/CABAnnot.pl --input contigs.cut.fa --sequenceType "nuclear" --organismName "Solanum lycopersicum" --version 1

/share/apps/CABAnnot1.0/bin/CABAnnot.pl --input contigs.cut.fa --sequenceType "nuclear" --organismName "Solanum lycopersicum" --version 1 --runSequencePreprocess 1 --runSequenceAnalysis 0 --chooseNoncodingProgram "1,2,3,4"

“Ctrl-A” and “d“

top

6. Genome visualization by JBrowse¶

Format reference sequences for JBrowse

cd /home/annotest/public_html

/home/annotest/public_html

Make directory "Tomato" at /home/user/public_html

mkdir Tomato

cp /home/annotest/annotation/* /home/annotest/public_html/Tomato/

Change directory to /home/user/public_html/Tomato

cd /home/annotest/public_html/Tomato

/home/annotest/public_html/Tomato

ls -al

Change directory to /home/user/public_html/JBrowse

cd /home/annotest/public_html/JBrowse

Configure JBrowse *** change annotest to your username

/home/annotest/public_html/JBrowse/bin/prepare-refseqs.pl --fasta /home/annotest/public_html/Tomato/contigs_nc.fasta --out /home/annotest/public_html/Tomato/TomatoV1

/home/annotest/public_html/JBrowse/bin/flatfile-to-json.pl --gff /home/annotest/public_html/Tomato/contigs_nc.codingseq.nr.6.annotate.gff --trackLabel "Gene" --trackType "JBrowse/View/Track/CanvasFeatures" --className "feature5" --key "gene" --subfeatureClasses '{"CDS": "transcript-CDS", "exon": "transcript-exon"}' --out /home/annotest/public_html/Tomato/TomatoV1

/home/annotest/public_html/JBrowse/bin/flatfile-to-json.pl --trackLabel "tRNA" --trackType "JBrowse/View/Track/CanvasFeatures" --key "tRNA" --className generic_parent --subfeatureClasses '{"match_part" : "feature"}' --gff /home/annotest/public_html/Tomato/contigs_nc.tRNA.gff --out /home/annotest/public_html/Tomato/TomatoV1

/home/annotest/public_html/JBrowse/bin/flatfile-to-json.pl --trackLabel "repeat masker" --trackType CanvasFeatures --key RepeatMasker --className generic_parent --subfeatureClasses '{"match_part" : "feature"}' -gff /home/annotest/public_html/Tomato/contigs_nc.fasta.out.gff --out /home/annotest/public_html/Tomato/TomatoV1

Display annotation result

http://www.breedserve.cab.kps.ku.ac.th/~annotest/JBrowse/?data=../Tomato/TomatoV1