Genome assembly and annotation

  • Understanding the Next Generation Sequencing (NGS) data.
  • How to download NGS data from sequencing service provider server
  • How to transfer NGS data to HPC server
  • How to use public NGS data
  • How to do basic NGS data analysis
  • Variant calling

                                                     Document by Tongyoo P.

Coverage / Read Count Calculator

Calculate how much sequencing you need to hit a target depth of coverage (or vice versa)

http://apps.bioconnector.virginia.edu/covcalc/

Get public NGS data

What is SRA? http://www.ncbi.nlm.nih.gov/sra/

Using basic unix command for downloading file from remote server

Remainder of path: /sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6 characters of accession>//.sra Where {SRR|ERR|DRR} should be either ‘SRR’, ‘ERR’, or ‘DRR’ and should match the prefix of the target .sra file

Example data set: SRR5714433 https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5714433

In [ ]:
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR571/SRR5714433/SRR5714433.sra

Using sratoolkit function

In [ ]:
prefetch -v SRR5714433
In [ ]:
fastq-dump --split-3 SRR5714433.sra

QC and sequence manipulation

Fastqc: ckeck quality http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc

In [ ]:
fastqc read.fastq

sickle - A windowed adaptive trimming tool for FASTQ files using quality https://github.com/najoshi/sickle

sickle pe/se [-t ] : solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8) [-l ] : default 20
[-q ] : default 20

In [ ]:
sickle pe -t sanger -f read1.fastq -r read2.fastq -o qtrim1.fastq -p qtrim2.fastq - s qtrim.unpaired.fastq
In [ ]:
sickle se -t sanger  -f input_file.fastq -o trimmed_output_file.fastq   

Trimmomatic: A flexible read trimming tool for Illumina NGS data http://www.usadellab.org/cms/?page=trimmomatic

In [ ]:
java -jar /share/apps/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10) Remove leading low quality or N bases (below quality 3) (LEADING:3) Remove trailing low quality or N bases (below quality 3) (TRAILING:3) Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15) Drop reads below the 36 bases long (MINLEN:36)

Mapping fastq to assembled scaffold, variant calling process

BWA: Generate bam file (binary NGS mapping file) samtools pipeline: Using bam files for SNP and indel detection

In [ ]:
cd
In [ ]:
mkdir vcf
In [ ]:
cd vcf
In [ ]:
pwd

Make a link to fastq and reference fasta

In [ ]:
ln -s /home/pumie/bam/contigs_nc.fasta .
In [ ]:
ln -s /home/pumie/bam/ch11_read1.fq.gz .
In [ ]:
ln -s /home/pumie/bam/ch11_read2.fq.gz .
In [ ]:
ls -lh

Building index for reference sequence

In [ ]:
seqkit seq -w 70 contigs_nc.fasta > contigs_nc70.fasta
In [ ]:
mv contigs_nc.fasta contigs_nc.fasta_ori
In [ ]:
mv contigs_nc70.fasta contigs_nc.fasta
In [ ]:
bwa index -p idx -a bwtsw contigs_nc.fasta

align forward reads to reference sequence

In [ ]:
bwa aln -t 4 idx ch11_read1.fq.gz > ch11f.bwa

align reverse reads to reference sequence

In [ ]:
bwa aln -t 4 idx ch11_read2.fq.gz > ch11r.bwa

Combine forward and reverse read. Create SAM file

In [ ]:
bwa sampe idx ch11f.bwa ch11r.bwa ch11_read1.fq.gz ch11_read2.fq.gz > ch11.sam

Manipulate reference sequence to equally line length

Convert sam to bam and SNP calling process using samtools Building index for reference sequence

In [ ]:
samtools faidx contigs_nc.fasta
In [ ]:
samtools import contigs_nc.fasta ch11.sam ch11.bam
In [ ]:
samtools sort ch11.bam -o ch11.srt.bam -@ 4
In [ ]:
samtools index ch11.srt.bam

View alignment result using tview

In [ ]:
samtools tview ch11.srt.bam contigs_nc.fasta

Variant calling using mpileup tools

samtools mpileup -uf ref.fasta S01.bam S02.bam S03.bam | bcftools call -mv --threads 4 -> snp.vcf

In [ ]:
samtools mpileup -uf contigs_nc.fasta ch11.srt.bam | bcftools call -mv --threads 4 -> ch11snp.vcf

View result in JBrowse

  • vcf file : ch11snp.vcf
  • bam alignment : ch11.srt.bam, ch11.srt.bam.bai
In [ ]:
cp ch11snp.vcf ch11.srt.bam ch11.srt.bam.bai ../public_html/Tomato
In [ ]:
cd ../public_html/Tomato/
In [ ]:
ls -lha

Manipulate vcf for viewing in JBrowse

  • make gz file
  • indexing using tabix
In [ ]:
bgzip ch11snp.vcf
In [ ]:
tabix -p vcf ch11snp.vcf.gz
In [ ]:
cp /home/pumie/bam/tracks.conf TomatoV1/