AMPtk documentation

AMPtk Overview

_images/amptk.pdf

AMPtk Quick Start

AMPtk can process NGS amplicon data from either 454, Illumina, or Ion Torrent sequencing platforms. In order to get started, you will need to know how your amplicon reads were generated, i.e. which primers you have used, PE or SE reads, where the amplicons have been barcoded, etc. For more information see the ReadLayout information.

All AMPtk commands show a help menu when they are run without any parameters.

AMPtk comes with some very small test datasets, here we will go through a standard MiSeq Illumina test run. The data is located in the test_data folder of the amptk installation directory. Move to that folder to begin.

Pre-Processing Reads

The first step is to pre-process or demultiplex your reads. In this example, the data is PE MiSeq (2 x 300 bp) and it has been processed by the sequencing center using the Illumina program bcl2fastq so each sample has a forward and reverse read file (denoted by _R1 and _R2 in the sample name). The folder illumina_test_data contains 3 PE samples we will use for a demo. Since the data is in this format, we will first use the amptk illumina format. Note that some sequencing centers distribute their data in different formats, amptk illumina2 and amptk illumina3 are additional methods that can be used to pre-process illumina data.

#run the pre-processing script, specifying primers used
amptk illumina -i illumina_test_data -o miseq -f fITS7 -r ITS4

-------------------------------------------------------
[03:53:16 PM]: OS: MacOSX 10.12.6, 8 cores, ~ 16 GB RAM. Python: 2.7.12
[03:53:16 PM]: AMPtk v1.0.0, USEARCH v9.2.64, VSEARCH v2.4.4
[03:53:16 PM]: Gzipped files detected, uncompressing
[03:53:16 PM]: Merging Overlaping Pairs using USEARCH
[03:53:16 PM]: working on sample 301-1 (Read Length: 300)
[03:53:17 PM]: 100 reads passed (100.0%)
[03:53:17 PM]: working on sample 301-2 (Read Length: 300)
[03:53:17 PM]: 100 reads passed (100.0%)
[03:53:17 PM]: working on sample spike (Read Length: 300)
[03:53:17 PM]: 400 reads passed (100.0%)
[03:53:17 PM]: Stripping primers and trim to 300 bp
[03:53:17 PM]: splitting the job over 8 cpus, but this may still take awhile
[03:53:17 PM]: Foward primer: GTGARTCATCGAATCTTTG,  Rev comp'd rev primer: GCATATCAATAAGCGGAGGA
-------------------------------------------------------
[03:53:18 PM]: Concatenating Demuxed Files
[03:53:18 PM]: 600 total reads
[03:53:18 PM]: 600 Fwd Primer found, 426 Rev Primer found
[03:53:18 PM]: 0 discarded too short (< 100 bp)
[03:53:18 PM]: 600 valid output reads
[03:53:18 PM]: Found 3 barcoded samples
                        Sample:  Count
                         spike:  400
                         301-1:  100
                         301-2:  100
[03:53:18 PM]: Output file:  miseq.demux.fq.gz (53.1 KB)
[03:53:18 PM]: Mapping file: miseq.mapping_file.txt
-------------------------------------------------------

You should see that script found 600 valid reads, and output them into a file called miseq.demux.fq.gz and you’ll note that it also output a QIIME-like mapping file miseq.mapping_file.txt which can be used later on to add meta data to and generate a BIOM file (during assigning taxonomy step). You can find a detailed log file in miseq.amptk-demux.log. The script has merged the PE reads using usearch, removed phiX spike-in, removed the forward and reverse primers, relabeled the FASTQ headers, and then concatenated all samples together. Note that the default settings require that reads have a valid forward primer, if you used a custom sequencing primer then you will need to turn this setting off with --require_primer off.

Clustering Data

We can now take the output here and run the UPARSE clustering algorithm. The steps of UPARSE are to quality filter with expected errors, dereplicate, sort by abundance, cluster using 97% identity to create OTUs, map original reads back to OTUs to make OTU table.

amptk cluster -i miseq.demux.fq.gz -o miseq

-------------------------------------------------------
[03:54:29 PM]: OS: MacOSX 10.12.6, 8 cores, ~ 16 GB RAM. Python: 2.7.12
[03:54:29 PM]: AMPtk v1.0.0, USEARCH v9.2.64, VSEARCH v2.4.4
[03:54:29 PM]: Loading FASTQ Records
[03:54:29 PM]: 600 reads (53.1 KB)
[03:54:29 PM]: Quality Filtering, expected errors < 1.0
[03:54:29 PM]: 394 reads passed
[03:54:29 PM]: De-replication (remove duplicate reads)
[03:54:29 PM]: 112 reads passed
[03:54:29 PM]: Clustering OTUs (UPARSE)
[03:54:29 PM]: 32 OTUs
[03:54:29 PM]: Cleaning up padding from OTUs
[03:54:29 PM]: Mapping Reads to OTUs and Building OTU table
[03:54:30 PM]: 429 reads mapped to OTUs (72%)
-------------------------------------------------------
OTU Clustering Script has Finished Successfully
-------------------------------------------------------
Clustered OTUs: /Users/jon/amptk/test_data/miseq.cluster.otus.fa
OTU Table: /Users/jon/amptk/test_data/miseq.otu_table.txt
-------------------------------------------------------

Filtering Data

Since we included a mock community in our sample, we will filter for index-bleed. Note in this toy example there is no index-bleed.

amptk filter -i miseq.otu_table.txt -f miseq.cluster.otus.fa -b spike -m mock2

-------------------------------------------------------
[03:56:53 PM]: OS: MacOSX 10.12.6, 8 cores, ~ 16 GB RAM. Python: 2.7.12
[03:56:54 PM]: AMPtk v1.0.0, USEARCH v9.2.64, VSEARCH v2.4.4
[03:56:54 PM]: Loading OTU table: miseq.otu_table.txt
[03:56:54 PM]: OTU table contains 32 OTUs and 429 read counts
[03:56:54 PM]: Mapping OTUs to Mock Community (USEARCH)
[03:56:54 PM]: Sorting OTU table naturally
[03:56:54 PM]: Removing OTUs according to --min_reads_otu: (OTUs with less than 2 reads from all samples)
[03:56:54 PM]: Normalizing OTU table to number of reads per sample
[03:56:54 PM]: Index bleed, mock into samples: 0.000000%.  Index bleed, samples into mock: 0.000000%.
[03:56:54 PM]: No spike-in mock (-b) or index-bleed (-p) specified, thus not running index-bleed filtering
[03:56:54 PM]: spike sample has 24 OTUS out of 26 expected; 0 mock variants; 0 mock chimeras; Error rate: 0.040%
[03:56:54 PM]: Filtering OTU table down to 8 OTUs and 136 read counts
[03:56:54 PM]: Filtering valid OTUs
-------------------------------------------------------
OTU Table filtering finished
-------------------------------------------------------
OTU Table Stats:      miseq.stats.txt
Sorted OTU table:     miseq.sorted.txt
Normalized/filter:    miseq.normalized.txt
Final Binary table:   miseq.final.binary.txt
Final OTU table:      miseq.final.txt
Filtered OTUs:        miseq.filtered.otus.fa
-------------------------------------------------------

Post clustering LULU

AMPtk can run the LULU post clustering tool that can identify potential errors in a dataset. This step is optional.

amptk lulu -i miseq.final.txt -f miseq.filtered.otus.fa -o miseq

-------------------------------------------------------
[Mar 07 11:47 AM]: OS: MacOSX 10.13.3, 8 cores, ~ 16 GB RAM. Python: 2.7.14
[Mar 07 11:47 AM]: AMPtk v1.1.0, USEARCH v9.2.64, VSEARCH v2.6.2
[Mar 07 11:47 AM]: R v3.4.1; LULU v0.1.0
[Mar 07 11:47 AM]: Loading 8 OTUs
[Mar 07 11:47 AM]: Generating pairwise percent identity between OTUs using VSEARCH at 84% identity
[Mar 07 11:47 AM]: Running LULU algorithm
[Mar 07 11:47 AM]: LULU has merged 1 OTUs, output data contains 7 OTUs
[Mar 07 11:47 AM]: LULU OTU table post processing finished
----------------------------------
OTU table:  miseq.lulu.otu_table.txt
OTU FASTA:  miseq.lulu.otus.fa
LULU map:   miseq.lulu.otu-map.txt
----------------------------------

Assign Taxonomy

We can now assign taxonomy to our OTUs and create the final BIOM output file.

amptk taxonomy -f miseq.lulu.otus.fa -i miseq.lulu.otu_table.txt -m miseq.mapping_file.txt -d ITS2 -o miseq

-------------------------------------------------------
[Mar 07 12:51 PM]: OS: MacOSX 10.13.3, 8 cores, ~ 16 GB RAM. Python: 2.7.14
[Mar 07 12:51 PM]: AMPtk v1.1.0, USEARCH v9.2.64, VSEARCH v2.6.2
[Mar 07 12:51 PM]: Loading FASTA Records
[Mar 07 12:51 PM]: 7 OTUs
[Mar 07 12:51 PM]: Global alignment OTUs with usearch_global (USEARCH)
[Mar 07 12:51 PM]: Classifying OTUs with UTAX (USEARCH)
[Mar 07 12:51 PM]: Classifying OTUs with SINTAX (USEARCH)
[Mar 07 12:52 PM]: Appending taxonomy to OTU table and OTUs
[Mar 07 12:52 PM]: Generating phylogenetic tree
[Mar 07 12:52 PM]: Taxonomy finished: miseq.taxonomy.txt
[Mar 07 12:52 PM]: Classic OTU table with taxonomy: miseq.otu_table.taxonomy.txt
[Mar 07 12:52 PM]: BIOM OTU table created: miseq.biom
[Mar 07 12:52 PM]: OTUs with taxonomy: miseq.otus.taxonomy.fa
[Mar 07 12:52 PM]: OTU phylogeny: miseq.tree.phy
-------------------------------------------------------

And we now have an OTU table complete with taxonomy:

#OTU ID     301-1   301-2   spike   Taxonomy
OTU1        0       57      0       GSL|100.0|KX857803;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Hymenochaetales,f:Schizoporaceae
OTU10       0       0       16      GSL|100.0|KF169595;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Fomitopsidaceae,g:Fomitopsis
OTU11       0       0       14      GS|100.0|KP055600;k:Fungi,p:Ascomycota,c:Leotiomycetes,o:Helotiales,f:Vibrisseaceae,g:Phialocephala,s:Phialocephala lagerbergii
OTU12       0       0       14      GS|100.0|KU668958;k:Fungi,p:Ascomycota
OTU13       0       0       10      GS|99.7|EU402583;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Polyporaceae,g:Leptoporus,s:Leptoporus mollis
OTU14       0       0       10      GSL|100.0|KP135060;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales
OTU15       0       0       10      GS|100.0|KU668956;k:Fungi,p:Ascomycota,c:Eurotiomycetes,o:Eurotiales,f:Trichocomaceae,g:Penicillium,s:Penicillium nothofagi
OTU16       0       0       13      GS|100.0|KM493837;k:Fungi
OTU17       0       0       10      GS|100.0|KU668960;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Fomitopsidaceae,g:Laetiporus,s:Laetiporus caribensis
OTU18       0       0       11      GS|100.0|KY886708;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Fomitopsidaceae,g:Laetiporus,s:Laetiporus cremeiporus
OTU19       0       0       10      GS|100.0|KU668959;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Polyporaceae,g:Wolfiporia,s:Wolfiporia dilatohypha
OTU2        36      0       0       GSL|100.0|HQ222028;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Strophariaceae,g:Pholiota
OTU20       7       3       0       GS|100.0|AY465463;k:Fungi,p:Ascomycota,c:Eurotiomycetes,o:Chaetothyriales,f:Herpotrichiellaceae,g:Phialophora
OTU21       0       0       10      GS|100.0|KU668966;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Phanerochaetaceae,g:Antrodiella,s:Antrodiella semisupina
OTU22       0       0       9       GSL|100.0|AY340039;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Hymenochaetales,f:Hymenochaetaceae,g:Phellinus,s:Phellinus cinereus
OTU23       0       0       9       GS|100.0|KM373239;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Polyporaceae,g:Trametes,s:Trametes gibbosa
OTU24       0       0       6       GS|100.0|EU402560;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Fomitopsidaceae,g:Laetiporus,s:Laetiporus sulphureus
OTU25       0       0       8       GS|100.0|KU668954;k:Fungi,p:Mortierellomycota
OTU26       0       0       9       GS|100.0|DQ398958;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Corticiales,f:Punctulariaceae,g:Punctularia,s:Punctularia strigosozonata
OTU27       19      0       0       GS|100.0|DQ647503;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Hymenochaetales,g:Peniophorella,s:Peniophorella pubera
OTU28       0       0       10      GS|100.0|KU668961;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Fomitopsidaceae,g:Laetiporus,s:Laetiporus persicinus
OTU29       0       4       0       US|0.8150|LN827700;k:Fungi,p:Ascomycota,c:Sordariomycetes,o:Sordariales
OTU3        0       0       21      GS|100.0|KU668970;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Meruliaceae,g:Bjerkandera,s:Bjerkandera adusta
OTU30       0       5       0       GSL|100.0|KF928438;k:Fungi,p:Ascomycota,c:Eurotiomycetes,o:Chaetothyriales,f:Herpotrichiellaceae,g:Exophiala,s:Exophiala cancerae
OTU31       2       0       0       GS|100.0|KX221389;k:Fungi
OTU32       1       2       0       GS|100.0|JX946684;k:Fungi,p:Ascomycota,c:Dothideomycetes,o:Venturiales,f:Venturiaceae,g:Protoventuria,s:Protoventuria alpina
OTU4        0       0       14      GS|100.0|KU668955;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Hymenochaetales,f:Schizoporaceae,g:Schizopora
OTU5        0       0       15      GS|100.0|KU668973;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Phanerochaetaceae,g:Phanerochaete,s:Phanerochaete laevis
OTU6        0       0       17      GS|100.0|KU668975;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Polyporaceae,g:Leptoporus,s:Leptoporus mollis
OTU7        0       0       17      GS|100.0|KU668968;k:Fungi,p:Ascomycota,c:Leotiomycetes,o:Helotiales
OTU8        0       0       17      GSL|100.0|KP216946;k:Fungi,p:Ascomycota,c:Eurotiomycetes,o:Eurotiales,f:Trichocomaceae,g:Aspergillus
OTU9        0       0       13      GS|100.0|EU402553;k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Polyporales,f:Fomitopsidaceae,g:Laetiporus,s:Laetiporus gilbertsonii

We then also have a BIOM file complete with any metadata and taxonomy (miseq.biom). This file could be taken to some of the downstream processing software.

AMPtk file formats

NGS Sequencing Data

AMPtk handles FASTQ or gzipped FASTQ input. Platform specific formats, i.e. SFF from 454 and BAM from Ion Torrent are also supported in their respective scripts.

Mapping file (QIIME-like)

The pre-processing scripts in AMPtk will automatically produce a QIIME-like mapping file for you if one is not specified in the command line arguments. The format is simliar to used in QIIME, although has fewer formatting rules. This file is intended to be used as a metadata file in the amptk taxonomy script. Below is an example of a mapping file for a dual indexed PE MiSeq run:

#SampleID   BarcodeSequence LinkerPrimerSequence    ReversePrimer   phinchID        Treatment
301-1       TCCGGAGA-CCTATCCT       GTGARTCATCGAATCTTTG     TCCTCCGCTTATTGATATGC    301-1   no_data
301-2       TCCGGAGA-GGCTCTGA       GTGARTCATCGAATCTTTG     TCCTCCGCTTATTGATATGC    301-2   no_data
spike       CGCTCATT-GGCTCTGA       GTGARTCATCGAATCTTTG     TCCTCCGCTTATTGATATGC    spike   no_data

A properly formatted AMPtk mapping file contains the first 5 columns. Pre-processing scripts will parse the mapping file and grab the primer sequences. Ion Torrent and 454 mapping files should have the following format:

#SampleID   BarcodeSequence LinkerPrimerSequence    ReversePrimer   phinchID        Treatment       Location
BC.1        CTAAGGTAAC      CCATCTCATCCCTGCGTGTCTCCGACTCAGCTAAGGTAACAGTGARTCATCGAATCTTTG    TCCTCCGCTTATTGATATGC    BC.1    Treatment1      Woods
BC.2        TAAGGAGAAC      CCATCTCATCCCTGCGTGTCTCCGACTCAGTAAGGAGAACAGTGARTCATCGAATCTTTG    TCCTCCGCTTATTGATATGC    BC.2    Treatment1      Woods
BC.3        AAGAGGATTC      CCATCTCATCCCTGCGTGTCTCCGACTCAGAAGAGGATTCAGTGARTCATCGAATCTTTG    TCCTCCGCTTATTGATATGC    BC.3    Treatment2      Field
BC.4        TACCAAGATC      CCATCTCATCCCTGCGTGTCTCCGACTCAGTACCAAGATCAGTGARTCATCGAATCTTTG    TCCTCCGCTTATTGATATGC    BC.4    Treatment1      Field
BC.5        CAGAAGGAAC      CCATCTCATCCCTGCGTGTCTCCGACTCAGCAGAAGGAACAGTGARTCATCGAATCTTTG    TCCTCCGCTTATTGATATGC    BC.5    Treatment2      Woods
BC.6        CTGCAAGTTC      CCATCTCATCCCTGCGTGTCTCCGACTCAGCTGCAAGTTCAGTGARTCATCGAATCTTTG    TCCTCCGCTTATTGATATGC    BC.6    Treatment2      Woods
BC.7        TTCGTGATTC      CCATCTCATCCCTGCGTGTCTCCGACTCAGTTCGTGATTCAGTGARTCATCGAATCTTTG    TCCTCCGCTTATTGATATGC    BC.7    Treatment2      Field

Note that the barcode is nested within the ‘LinkerPrimerSequence’ column. In this example, there are two metadata columns (Treatment and Location).

Barcode FASTA files

Barcode FASTA files for pre-processing should be in standard FASTA format:

>Sample1
CTAAGGTAAC
>Sample2
TAAGGAGAAC
>Sample3
AAGAGGATTC
>Sample4
TACCAAGATC
>Sample5
CAGAAGGAAC
>Mock
CTGCAAGTTC

AMPtk Pre-Processing

One of the most critical steps in analyzing amplicon NGS data is in the initial processing steps of a pipeline, i.e. finding barcodes/indexes, removing primers, removing phiX, trimming/padding reads to a desired length, etc. In AMPtk I refer to these steps as pre-processing steps and the method is slightly different depending on the sequencing platform that you used as well as the experimental design (primer design, sequencing primers, etc). By being thorough in pre-processing steps you can ensure that only high quality data moves through your pipeline thereby reducing spurious and/or contaminating OTUs. Quality filtering in AMPtk is done at the clustering step and utilizes expected-errors trimming, thus pre-processing steps are mainly focused on removing sequencing artefacts from the amplicon reads, such as primer and barcode sequences. Additionally, since expected-errors quality trimming is used, the length of the amplicon sequences contributes significantly in this calculation, so care must be taken for amplicons of variable length, such as the internal transcribed spacer (ITS) region used for characterization of fungal communities.

Processing Ion Torrent and 454 Data

Roche 454 and Ion Torrent PGM machines are flowgram based sequencers that have similar experimental setup but differ in their nucleotide detection mechanisms. In terms of processing data from these platforms, the pre-processing steps are similar. Flowgram based sequencers sequence the amplicon in one direction, thus generate single-ended (SE) reads. Multiplexing on these instruments is done through incorporation of a unique barcode sequence in-between the system specific adapter sequences and the amplicon-specific primer. Therefore, reads have the following structure:

5' - Barcode:For_Primer:amplicon:Rev_primer:adapter - 3'

Typically these read structure is produced via fusion primers that generally look like the following:

#Ion Torrent Forward primer (Primer A:Key Sequence (TCAG)):
5'-CCATCTCATCCCTGCGTGTCTCCGACTCAG-[Barcode]--template-specific-primer-3'

#Ion Torrent Reverse primer (Primer P1):
5'-CCTCTCTATGGGCAGTCGGTGAT--template-specific-primer-3'

This means that we need to pre-process our reads by identifying the unique Barcode at the 5’ end of the read which will tell us which multiplexed sample the read belongs to. Then we need to remove the forward and reverse primers from the reads, keeping in mind that with very long amplicons, we may not be able to get high quality sequence all the way to the 3’ end of the read. However, in a high quality read, we will be able to identify the barcode and find the forward primer -> if we are unable to do this, then the read should be discarded.

The next step in many amplicon pipelines is to truncate the reads to a set length to be used for downstream clustering. While this approach works when amplicon pools are identical or very close to the same length (such as bacterial 16S and/or LSU sequences), this becomes problematic for amplicons that are variable in length (such as the fungal ITS region). Metabarcoding schemes for fungi use either the ITS1 or ITS2 region, which are each on average of 250 bp, however sequence lengths can vary for reach region from 125 bp up to more than 600 bp. Therefore, truncating reads at a set length causes one of two problems to occur: i) reads less than the truncated value are discarded resulting in loss of real biological sequences or ii) truncated sequences lose the informative length characteristic as well as information is lost for taxonomy assignment. To address this, AMPtk removes both forward and reverse primers and then truncates reads ONLY if they are longer than a set length (controlled by the -l, --trim_len argument). The default value is 300 bp, which is a value where you can get reliable sequence read quality and yet maintain as much information as possible for clustering/taxonomy.

Summary: for Ion Torrent and 454 platforms, AMPtk pre-processing runs the following steps:

  1. identify unique barcode (index) sequence
  2. rename sequence header with sample name from barcode
  3. find forward and reverse primers
  4. remove (trim) barcode and primer sequences
  5. if sequence is longer than --trim_len, truncate sequence

These steps are executed with the following commands:

#process Ion Torrent Data directly from unaligned BAM file
amptk ion -i mydata.bam -o mydata -f fITS7 -r ITS4 --barcode_fasta barcodes.fa

#process Ion Torrent Data from fastq.gz and trim to 250 bp, specify primer sequences
amptk ion -i mydata.fasta.gz -o mydata -l 250 -m my_mapping_file.txt \
    -f GTGARTCATCGAATCTTTG -r TCCTCCGCTTATTGATATGC

#process 454 data from SFF output
amptk 454 --sff MyRun.sff -o mydata -l 270 --barcode_fasta barcodes.fa

#process 454 data from fasta + qual, barcoded on both 5' and 3' ends
amptk 454 --fasta MyRun.fasta --qual MyRun.qual -o mydata \
    --barcode_fasta fwdbarcodes.fa --reverse_barcode revbarcodes.fa

Processing Illumina Data

In contrast to SE reads generated from 454 and Ion Torrent, Illumina data has the added benefit of being able to generated paired-end (PE) reads. Initially (ca 2013), Illumina data was limited to short read lengths (< 150 bp), however, starting with the Illumina MiSeq platform read lengths quickly increased and currently Illumina can deliver 2 x 300 bp reads, which offers a combined read length longer than Ion Torrent (400-500 bp). One way to generate illumina compatible amplicons is through a two-step PCR reaction, where you use your primer of interest fused to a general Illumina adapter, a second barcoding reaction then attaches an barcode sequence and an additional adapter.

Forward Nested Primer Sequence
5'- ACACTCTTTCCCTACACGACGCTCTTCCGATCT-template-specific-primer -3'

Reverse Nested Primer Sequence
5'- GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT-template-specific-primer -3'

The second barcoding reaction then adds a unique barcode [i5] to the 5’ adapter and a barcode [i7] to the 3’ adapter sequence. Thus the final construct schematically looks like this:

5' - Adapter1-[i5]-Adapter2-For_Primer:amplicon:Rev_Primer-Adapter3-[i7]-Adapter4 - 3'

The machine then does 4 different sequencing reactions: 1) Read 1 (sequence from Adapter2 for 300 bp), 2) Index Read 1 (sequences the i5 barcode), 3) Read 2 (sequence in reverse direction from Adapter3, 4) Index Read 2 (sequences the i7 barcode). The software then strips the adapter sequences, and the reads will then look like this:

5' -  For_Primer:amplicon:Rev_Primer - 3'

Illumina software then de-multiplexes the reads based on the index sequences and splits each sample into two files that are named as such: <sample name>_<barcode>_L<lane number>_R<read number>_<set number>.fastq.gz

#Example naming after bcl2fastq
Sample1_ATCCTTG_L001_R1_001.fastq.gz
Sample1_TCTGGTA_L001_R2_001.fastq.gz

Unfortunately, Sequencing centers do not distribute data to customers in the same format. I’ve seen data in at least 3 formats (there are probably more - let me know if your data is in a format that AMPtk cannot process), thus there are 3 illumina based commands in AMPtk: amptk illumina processes a folder of demultiplexed PE reads, amptk illumina2 processes data that contain an additional 5’ unique index sequence, and amptk illumina3 which processes data as R1, R2, I1 (where index reads are in separate file).

AMPtk pre-processing of Illumina runs the same steps as in 454/Ion Torrent with some additional steps needed, namely merging of PE reads and filtering phiX spike-in. The general workflow for Illumina reads is:

  1. Merge PE reads (use USEARCH or VSEARCH)
  2. rename sequence header with sample name
  3. filter reads that are phiX (USEARCH)
  4. find forward and reverse primers (pay attention to --require_primer argument)
  5. remove (trim) primer sequences
  6. if sequence is longer than --trim_len, truncate sequence

Some examples of how to issue these commands:

#simple folder of PE MiSeq data
amptk illumina -i miseq_folder/ -o mydata -f ITS1-F -r ITS2

#same folder, however, keep only full length sequences
amptk illumina -i miseq_folder/ -o mydata -f ITS1-F -r ITS2 --full_length

#process folder of data, however, custom sequencing primer used
amptk illumina -i miseq_folder/ -o mydata -f ITS1-F -r ITS2 --require_primer off

#data is in R1, R2, I1 format
amptk illumina3 --forward data_R1.fastq.gz --reverse data_R2.fastq.gz --index data_I1.fastq.gz \
    -m mapping_file.txt -o mydata --fwd_primer ITS1-F --rev_primer ITS2

Processing SRA Data

Amplicon data from the NCBI Small Read Archive (SRA) is typically provided in a single FASTQ file per sample, i.e. an experiment with 48 samples will have 48 SRA archives, which could be composed of PE reads or SE reads depending on the platform. Extracting PE Illumina data for such a project could be done like this:

For example, BioProject PRJNA400449 is composed of 24 samples. To download this project using SRApy you could run the following:

#download all SRA runs from the BioProject PRJNA400449
get-project-sras.py -e myemail@address.edu -d output_folder -p 400449 -F {name}.sra

#then convert SRA to FASTQ using sra-tools fastq-dump
cd output_folder; for file in *.sra; do fastq-dump -F --split-files $file; done; cd ..

Since these files are PE Illumina format, you can use the standard amptk illumina command:

amptk illumina -i output_folder -f ITS1-F -r ITS4 --require_primer off -o mydata

On the other hand, Roche 454 data and Ion Torrent data are contained in a single FASTQ file -> notably different than their raw data which is located in a single file with barcode sequences intact. Typically in SRA downloaded files, the unique barcode sequence is missing, therefore you can get these data into AMPtk using the amptk SRA command which takes a folder of single FASTQ files as input. For example:

#download all SRA runs from the BioProject PRJNA305905 (Ion Torrent Run)
get-project-sras.py -e myemail@address.edu -d output_folder -p 305905 -F {name}.sra

#then convert SRA to FASTQ using sra-tools fastq-dump
cd output_folder; for file in *.sra; do fastq-dump -F $file; done; cd ..

Now you can demultiplex using the amptk SRA command:

amptk SRA -i output_folder -f AGGGTGCTCCTCACAGCCCTGTG -r TGTCCCCGCRGCRMATTTCCTG -o mydata

AMPtk Clustering

Operational Taxonomic Unit (OTU) clustering has traditionally been done using a variety of methods, most simply clusters are generated based some sort of alignment between sequences and clusters are defined based on some percent identity, i.e. 97%. More recently, the DADA2 denoising algorithm provides a clustering independent method that attempts to “correct” or “denoise” each sequence to a corrected sequence using statistical modeling of sequencing errors. The clustering method you choose is highly dependent on what your research question is and what amplicons you are sequencing.

All clustering scripts in AMPtk take the output of pre-processing (amptk ion, amptk illumina, etc) which is named output.demux.fg.gz. Once a run has been pre-processed, you can easily run several different clustering algorithms from the same input dataset. The first step of all AMPtk clustering scripts is to quality trim the reads using expected-errors quality trimming, which is literally an accumulation of the error probability for reach base pair in the read. Thus, longer reads will have higher expected errors than shorter reads. The quality trimmed reads are then pushed through each of the algorithms which output a set of OTUs. In AMPtk, all OTU tables are generated from mapping the raw reads (no quality trimming done) to the OTUs (note you can change this parameter but you probably shouldn’t).

UPARSE clustering

The most commonly used method in OTU generation pipelines is UPARSE. You can run UPARSE clustering in AMPtk as follows:

#run default settings
amptk cluster -i mydata.demux.fq.gz -o mydata

#run with increased expected errors filter, perhaps you have really long reads
amptk cluster -i mydata.demux.fq.gz -e 2.0 -o mydata

This script will generated a log file as well as OTUs named mydata.cluster.otus.fa and an OTU table mydata.otu_table.txt.

DADA2 denoising

DADA2 can also be run from AMPtk, which utilizes a slightly modified version of DADA2 to retain the AMPtk data structure.

#run DADA2 on illumina dataset
amptk dada2 -i mydata.demux.fq.gz --platform illumina -o mydata

#run DADA2 on Ion Torrent dataset
amptk dada2 -i mydata.demux.fq.gz --platform ion -o mydata

You will notice that the output of DADA2 results in 2 sets of output, the first is based on inferred sequences (iSeqs) or exact sequence variants (ESVs) and the second set out output comes from clustering the iSeqs into biological OTUs using UCLUST. Depending on your research question, one or both of these outputs maybe useful.

UNOISE2 denoising

UNOISE2 is a denoising algorithm run in USEARCH9, which is similar in idea to DADA2. Likewise, the output data from UNOISE2 is the same as DADA2 -> where you get both multi-fasta and OTU tables for the iSeqs as well as clustered iSeqs.

#run UNOISE2
amptk unoise2 -i mydata.demux.fq.gz -o mydata

UNOISE3 denoising

UNOISE3 is the denoising algorithm run in USEARCH10 (apparently the successor of UNOISE2). Likewise, the output data from UNOISE3 is the same as DADA2 and UNOISE2, although “it works better”…. Note, for unoise3 you must have USEARCH10 installed.

#run UNOISE3
amptk unoise3 -i mydata.demux.fq.gz -o mydata

AMPtk OTU Table Filtering

An NGS sequencing artefact where reads are assigned to the wrong barcode sample has been reported several times in the literature. It’s been referred to as “index-hopping”, “barcode crossover”, “contamination”, etc, here I refer to this phenomenon as “index-bleed” -> where a small percentage of reads bleed into other samples. This phenomenon was first reported on Roche 454 platforms and more recently has been reported on Illumina. The mechanism of index-bleed has yet to be determined, however, it seems to happen during the amplification process of NGS sequencing (emulsion PCR on 454/Ion Torrent or cluster generation on Illumina). Regardless of the mechanism, the impacts of low level index-bleed for downstream community ecology statistics could be large, especially if presence/absence metrics are used. Coupled with countless examples in the literature that show that read abundances in NGS amplicon experiments do not represent biological abundances, it is important to come up with a solution to deal with index-bleed.

Using spike-in control mock communities is a way to measure the sequencing artefacts as well as bioinformatic steps used in a pipeline. Spike-in controls allow you to see if the number of OTUs generated from a run/software make sense with what you put in. It is well-known that PCR amplification will bias your sample abundances and is unpredictable in the sense that all metabarcoding amplicons don’t amplify with same efficiency in a complex mixture. AMPtk uses spike-in mock communities to measure the degree of index-bleed in a sequencing run and then conservatively applies that threshold to remove read counts that are within the range of index-bleed from an OTU table. The steps are done on an OTU-basis, meaning that low-abundance OTUs are not indiscrimately dropped solely due to the fact that they didn’t PCR amplify or sequence well.

In AMPtk, this process is done using the amptk filter command, which takes an OTU table and OTUs in FASTA format (i.e. output from any of the amptk clustering commands).

Usage:       amptk filter <arguments>
version:     1.5.3

Description: Script filters OTU table generated from the `amptk cluster` command and should
            be run on all datasets to combat barcode-switching or index-bleed (as high as
            2%% in MiSeq datasets, ~ 0.3%% in Ion PGM datasets).  This script works best when
            a spike-in control sequence is used, e.g. Synthetic Mock, although a mock is not required.

Required:    -i, --otu_table     OTU table
            -f, --fasta         OTU fasta

Optional:    -o, --out           Base name for output files. Default: use input basename
            -b, --mock_barcode  Name of barcode of mock community (Recommended)
            -m, --mc            Mock community FASTA file. Required if -b passed. [synmock,mock1,mock2,mock3,other]
            -c, --calculate     Calculate index-bleed options. Default: all [in,all]
            -d, --drop          Sample(s) to drop from OTU table. (list, separate by space)
            --negatives         Negative sample names. (list, separate by space)
            --ignore            Ignore sample(s) during index-bleed calc (list, separate by space)

Filtering    -n, --normalize     Normalize reads to number of reads per sample [y,n]. Default: y
            -p, --index_bleed   Filter index bleed between samples (percent). Default: auto (calculated from -b,--mock_barcode)
            -t, --threshold     Number to use for establishing read count threshold. Default: max [max,sum,top5,top10,top25]
            -s, --subtract      Threshold to subtract from all OTUs (any number or auto). Default: 0
            --delimiter         Delimiter of OTU tables. Default: tsv  [csv, tsv]
            --min_reads_otu     Minimum number of reads for valid OTU from whole experiment. Default: 2
            --min_samples_otu   Minimum number of samples for valid OTU from whole experiment. Default: 1
            --col_order         Column order (separate by space). Default: sort naturally
            --keep_mock         Keep Spike-in mock community. Default: False
            --show_stats        Show OTU stats on STDOUT
            --debug             Keep intermediate files.
            -u, --usearch       USEARCH executable. Default: usearch9

The steps of amptk filter are:

  1. Maps OTU sequences to those provided from the mock community (-m, --mc argument)
  2. Parses the OTU table, normalizing the read counts for each sample (optional, but recommended)
  3. Next it calculates the number of reads that bleed into the mock community and the number of reads that bleed from the mock community to the rest of the dataset. The default setting -c all is desinged for a synthetic mock, if you have biological mock (i.e. real OTUs that might be in your sample) then you can pass the -c in option to only look at index-bleed into the mock community sample.
  4. Then the index-bleed threshold is calculated for each OTU separately based on -t, --threshold value and read counts less than the calculated threshold are set to 0.
  5. The final output then is the filtered OTU table containing actual read counts (normalization is only used for index-bleed filtering).

If you do not have a spike-in mock community in your sample, you can still use amptk filter by providing an index-bleed percentage (-p, --index_bleed) which will over-ride the automated calculation. A value of -p 0.005 or 0.5% is typically able to remove the effects of index-bleed in most MiSeq Illumina datasets.

AMPtk Taxonomy

AMPtk can assign taxonomy using Blast, RDP Classifier, Global Alignment, UTAX, and/or SINTAX. The default method in AMPtk is ‘hybrid’ taxonomy assignment which calculates a consensus LCA (last common ancestor) taxonomy based on the results of Global Alignment (USEARCH/VSEARCH), UTAX, and SINTAX. This method is a conservative estimate of taxonomy because of the use of a LCA algorithm, however, is more robust than using any of the searches individually. The LCA method used here is applied twice, first in finding the LCA of identical Global Alignment hits and then again to compare the taxonomy from Global Alignment to the Bayesian Classifiers. If there is no conflict between taxonomies from each method, the taxonomy string with most information is retained.

The hybrid taxonomy algorithm works as follows:

  1. Global Alignment is performed on a reference database. The method captures all of the top hits from the database that have Percent Identity higher than --usearch_cutoff (Default: 0.7). The top hits that have the most taxonomy information (levels) are retained and LCA is run on these hits.
  2. UTAX Classifier is run generating a taxonomy string that scores better than --utax_cutoff (Default: 0.8). v1.5.1 and greater do not run UTAX
  3. SINTAX Classifier is run generating a taxonomy string that scores better than --sintax_cutoff (Default: 0.8).
  4. The Bayesian Classifier results are compared, and the method that produces the most levels of taxonomy above the threshold is retained.
  5. If the best Global Alignment result is less than 97% identical, then final taxonomy string defaults to the best Bayesian Classifier result (UTAX or SINTAX).
  6. If the best Global Alignment result is greater than 97% identical then that hit is retained. A final LCA algorithm is applied to the Global Alignment hit and the Best Bayesian Classifier hit.

The resulting taxonomy strings have the following format:

#OTUID  taxonomy
OTU1    GSL|100.0|KP776995;k:Fungi,p:Ascomycota,c:Sordariomycetes,o:Coniochaetales,f:Coniochaetaceae,g:Coniochaeta
OTU2    US|0.9389|HF674734;k:Fungi,p:Ascomycota
OTU3    GS|100.0|AY465463;k:Fungi,p:Ascomycota,c:Eurotiomycetes,o:Chaetothyriales,f:Herpotrichiellaceae,g:Phialophora
OTU4    GS|98.0|KX222656;k:Fungi,p:Ascomycota,c:Sordariomycetes,o:Sordariales,f:Lasiosphaeriaceae
OTU5    US|1.0000|FM200433;k:Fungi
OTU6    GS|100.0|KF850373;k:Fungi,p:Ascomycota,c:Sordariomycetes,o:Sordariales,f:Chaetomiaceae,g:Trichocladium,s:Trichocladium opacum
OTU7    GDL|100.0|KF660573;k:Fungi,p:Ascomycota,c:Dothideomycetes,o:Pleosporales
OTU8    GS|100.0|KX611531;k:Fungi,p:Ascomycota,c:Leotiomycetes,o:Helotiales,f:Helotiaceae,g:Varicosporium,s:Varicosporium elodeae
OTU9    GSL|100.0|JN655624;k:Fungi,p:Ascomycota,c:Sordariomycetes,o:Hypocreales,g:Ilyonectria
OTU10   SS|0.9000|KT197416;k:Fungi,p:Ascomycota,c:Leotiomycetes,o:Helotiales

Taxonomy information follows the ; in the string. The pre-taxonomy identifiers are coded to mean the following:

#OTUID  taxonomy
OTU1    Method|Pident/Score|Ref_DB ID;

Coding for the different methods:

  • GS: Global Alignment (G), Bayesian and Global Alignment Taxonomy is the Same (S)
  • GSL: Global Alignment (G), Bayesian and Global Alignment Taxonomy is the Same (S), LCA algorithm used (L)
  • GDL: Global Alignment (G), Bayesian and Global Alignment Taxonomy Discrepancy (D), LCA algorithm used (L)
  • US: UTAX Classifier (U), Bayesian and Global Alignment Taxonomy is the Same (S)
  • SS: SINTAX Classifier (S), Bayesian and Global Alignment Taxonomy is the Same (S)

Assigning Taxonomy

A typically command to assign taxonomy in AMPtk looks like this:

amptk taxonomy -i input.otu_table.txt -f input.cluster.otus.fa -m input.mapping_file.txt -d ITS2

This command will run the default hybrid method and will use the ITS2 database (-d ITS2). The output of this command will be a tab delimited taxonomy file input.taxonomy.txt, an OTU table contain taxonomy as the last column input.otu_table.taxonomy.txt`, a multi-fasta file with taxonomy as OTU headers input.otus.taxonomy.fa, and BIOM file containing taxonomy and metadata input.biom.

Taxonomy Databases

AMPtk is packaged with 4 reference databases for fungal ITS, fungal LSU, bacterial 16S, and arthropod/chordate mtCOI. These pre-built databases are updated frequently when reference databases are updated and can be downloaded/installed as follows:

#install all databases
amptk install -i ITS 16S LSU COI

#install only ITS database
amptk install -i ITS

#update database
amptk install -i ITS 16S LSU COI --force

Users can also build their own custom databases, with the largest obstacle to overcome being formatting the taxonomy headers for reference databases. Because AMPtk uses UTAX/SINTAX Bayesian classifiers, it uses the same taxonomy header formatting which looks like the following Kingdom(k), Phylum(p), Class(c), Order(o), Family(f), Genus(g), Species(s):

>BOLD:ACI6695;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Coleoptera,f:Elateridae,g:Nipponoelater,s:Nipponoelater babai
>S004604051;tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Hymenochaetales,f:Hymenochaetaceae,g:Inonotus,s:Sanghuangporus zonatus
>S004127186;tax=k:Fungi,p:Ascomycota
>S004061552;tax=k:Fungi,p:Ascomycota,c:Eurotiomycetes,s:Pyrenula sanguinea

Note that if levels of taxonomy are unknown they can be left out, but should not contain things like unclassified, unknown, incertae_sedis -> as these levels of taxonomy are not informative and will produce undesired results.

Taxonomy databases are built with the amptk database command. This command contains some parsers for known fasta header datasets, however, it is likely that generating custom databases will require some scripting to reformat the fasta headers. The pre-build databases for AMPtk were constructed as follows:

Fungal ITS DB

These databases were created from Unite v8.0, first downloading two databases from the UNITE website. First the General FASTA release of the DB here, and here. Then the Full UNITE+INSD database here. For the general FASTA releases, the ‘developer’ fasta files are used. The taxonomy information is then reformated and databases produced as follows:

#Create full length ITS USEARCH Database, convert taxonomy, and create USEARCH database
amptk database -i UNITE_public_all_02.02.2019.fasta -f ITS1-F -r ITS4 \
    --primer_required none -o ITS --create_db usearch --install --source UNITE:8.0

#create SINTAX database
amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
    -o ITS_SINTAX --create_db sintax -f ITS1-F -r ITS4 --derep_fulllength \
    --install --source UNITE:8.0 --primer_required none

#Create UTAX Databases
amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta  \
    -o ITS_UTAX --create_db utax -f ITS1-F -r ITS4 \
    --derep_fulllength --install --source UNITE:8.0 --primer_required none

amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
    -o ITS1_UTAX -f ITS1-F -r ITS2 --primer_required rev --derep_fulllength \
    --create_db utax --install --subsample 65000 --source UNITE:8.0

amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
    -o ITS2_UTAX --create_db utax -f fITS7 -r ITS4 --derep_fulllength \
     --install --source UNITE:8.0 --primer_required for

Arthropod/Chordate mtCOI DB

These data were pulled from the BOLDv4 database Since most studies using mtCOI regions are interested in identification of insects in the diets of animals, the BOLD database was queried as follows. All Chordata sequences were downloaded by querying the BIN database using the search term Chordata. Similarly, the Arthropods were searched by querying the BIN databases using the search term Arthropoda. All data was then downloaded as TSV output.

The TSV output files (~ 6GB) where then each formatted using the following method, which reformats the taxonomy information and pulls sequences that are annotated in BINS and then clusters sequences in each bin to 99%.

Since it can literally take days to download the arthropod dataset, if you’d like to experiment with the data you can get a copy here: chordates and arthropods.

#reformat taxonomy
bold2utax.py -i arthropods.bold.02092019.txt -o chordates --cluster 99 --drop_suppressed
bold2utax.py -i arthropods.bold.02092019.txt -o arthropods --cluster 99 --drop_suppressed

#combine datasets for usearch
cat arthropods.bold-reformated.fa chordates.bold-reformated.fa > arth-chord.bold-reformated.fasta

#generate global alignment database
amptk database -i arth-chord.bold.reformated.fasta -f LCO1490 -r mlCOIintR --primer_required none \
    --derep_fulllength --format off --primer_mismatch 4 -o COI --min_len 200 --create_db usearch \
    --install --source BOLD:20190219

The second set of output files from bold2utax.py are named with .BIN-consensus.fa which are the result of 99% clustering for each BIN. We will combine those for the two datasets and then use those data to generate the SINTAX and UTAX databases.

#combine datasets
cat arthropods.BIN-consensus.fa chordates.BIN-consensus.fa > arth-chord.bold.BIN-consensus.fasta

#generate SINTAX database
amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required none \
    --derep_fulllength --format off --primer_mismatch 4 -o COI_SINTAX --min_len 200 --create_db sintax \
    --install --source BOLD:20190219

#generate UTAX database, need to subsample for memory issues with 32 bit usearch and we require rev primer match here
amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required rev \
    --derep_fulllength --format off --subsample 00000 --primer_mismatch 4 -o COI_UTAX --min_len 200 \
    --create_db utax --install --source BOLD:20190219

LSU database

The fungal 28S database (LSU) was downloaded from RDP. The sequences were then converted into AMPtk databases as follows:

amptk database -i RDP_v8.0_fungi.fa -o LSU --format rdp2utax --primer_required none \
    --skip_trimming --create_db usearch --derep_fulllength --install --source RDP:8

amptk database -i RDP_v8.0_fungi.fa -o LSU_SINTAX --format rdp2utax --primer_required none \
    --skip_trimming --create_db sintax --derep_fulllength --install --source RDP:8

amptk database -i RDP_v8.0_fungi.fa -o LSU_UTAX --format rdp2utax --primer_required none \
    --skip_trimming --create_db utax --derep_fulllength --install --source RDP:8 --subsample 45000

To generate a training set for UTAX, the sequences were first dereplicated, and clustered at 97% to get representative sequences for training. This training set was then converted to a UTAX database:

amptk database -i fungi.trimmed.fa -o LSU_UTAX --format off \
    --skip_trimming --create_db utax --keep_all --install

16S database This is downloaded from R. Edgar’s website and then formatted for AMPtk. Note there is room for substantial improvement here, I just don’t typically work on 16S - so please let me know if you want some suggestions on what to do here. Here I reformatted the “domain” taxonomy level to “kingdom” for simplicity (even though I know it is taxonomically incorrect).

amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S --format off --create_db usearch \
    --skip_trimming --install --primer_required none --derep_fulllength

amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S_SINTAX --format off --create_db sintax \
    -f 515FB -r 806RB --install --primer_required for --derep_fulllength

amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S_UTAX --format off --create_db sintax \
    -f 515FB -r 806RB --install --primer_required for --derep_fulllength

Checking Installed Databases

A simple amptk info command will show you all the arguments as well as display which databases have been installed.

amptk info

------------------------------
Running AMPtk v 1.3.0
------------------------------
Taxonomy Databases Installed:
------------------------------
 DB_name         DB_type              FASTA originated from                Fwd Primer Rev Primer Records     Date
 ITS.udb         usearch                   UNITE_public_01.12.2017.fasta   ITS1-F      ITS4     532025  2018-05-01
 ITS1_UTAX.udb   utax  sh_general_release_dynamic_s_01.12.2017_dev.fasta   ITS1-F      ITS2      57293  2018-05-01
 ITS2_UTAX.udb   utax  sh_general_release_dynamic_s_01.12.2017_dev.fasta    fITS7      ITS4      55962  2018-05-01
 ITS_UTAX.udb    utax    sh_general_release_dynamic_01.12.2017_dev.fasta   ITS1-F      ITS4      30580  2018-05-01
------------------------------

AMPtk Commands

A description for all AMPtk commands.

AMPtk wrapper script

AMPtk is a series of Python scripts that are launched from a Python wrapper script. Each command has a help menu which you can print to the terminal by issuing the command without any arguments, i.e. amptk yields the following.

$ amptk

Usage:       amptk <command> <arguments>
version:     1.3.0

Description: AMPtk is a package of scripts to process NGS amplicon data.
             Dependencies:  USEARCH v9.1.13 and VSEARCH v2.2.0

Process:     ion         pre-process Ion Torrent data
             illumina    pre-process folder of de-multiplexed Illumina data
             illumina2   pre-process PE Illumina data from a single file
             illumina3   pre-process PE Illumina + index reads (i.e. EMP protocol)
             454         pre-process Roche 454 (pyrosequencing) data
             SRA         pre-process single FASTQ per sample data (i.e. SRA data)

Clustering:  cluster     cluster OTUs (using UPARSE algorithm)
             dada2       dada2 denoising algorithm (requires R, dada2, ShortRead)
             unoise2     UNOISE2 denoising algorithm
             unoise3     UNOISE3 denoising algorithm
             cluster_ref closed/open reference based clustering (EXPERIMENTAL)

Utilities:   filter      OTU table filtering
             lulu        LULU amplicon curation of OTU table
             taxonomy    Assign taxonomy to OTUs
             show        show number or reads per barcode from de-multiplexed data
             select      select reads (samples) from de-multiplexed data
             remove      remove reads (samples) from de-multiplexed data
             sample      sub-sample (rarify) de-multiplexed reads per sample
             drop        Drop OTUs from dataset
             stats       Hypothesis test and NMDS graphs (EXPERIMENTAL)
             summarize   Summarize Taxonomy (create OTU-like tables and/or stacked bar graphs)
             funguild    Run FUNGuild (annotate OTUs with ecological information)
             meta        pivot OTU table and append to meta data
             heatmap     Create heatmap from OTU table
             SRA-submit  De-multiplex data and create meta data for NCBI SRA submission

Setup:       install     Download/install pre-formatted taxonomy DB. Only need to run once.
             database    Format Reference Databases for Taxonomy
             info        List software version and installed databases
             primers     List primers hard-coded in AMPtk. Can use in pre-processing steps.
             version     List version
             citation    List citation

AMPtk Pre-Processing

amptk ion

Script that demulitplexes Ion Torrent data. Input can be either an unaligned BAM file or FASTQ file. The IonXpress 1-96 barcodes are hard-coded into AMPtk and is the default setting for providing barcode sequences to the script. Alternatively, you can provide a --barcode_fasta file containing barcodes used or a QIIME like mapping file. For file formats see here, and for more information see here.

Usage:       amptk ion <arguments>
version:     1.3.0

Description: Script processes Ion Torrent PGM data for AMPtk clustering.  The input to this script
             should be a FASTQ file obtained from the Torrent Server analyzed with the
             `--disable-all-filters` flag to the BaseCaller.  This script does the following:
             1) finds Ion barcode sequences, 2) relabels headers with appropriate barcode name,
             3) removes primer sequences, 4) trim/pad reads to a set length.

Arguments:   -i, --fastq,--bam   Input BAM or FASTQ file (Required)
             -o, --out           Output base name. Default: out
             -m, --mapping_file  QIIME-like mapping file
             -f, --fwd_primer    Forward primer sequence. Default: fITS7
             -r, --rev_primer    Reverse primer sequence Default: ITS4
             -b, --barcodes      Barcodes used (list, e.g: 1,3,4,5,20). Default: all
             -n, --name_prefix   Prefix for re-naming reads. Default: R_
             -l, --trim_len      Length to trim/pad reads. Default: 300
             -p, --pad           Pad reads with Ns if shorter than --trim_len. Default: off [on,off]
             --min_len           Minimum length read to keep. Default: 100
             --full_length       Keep only full length sequences.
             --barcode_fasta     FASTA file containing barcodes. Default: pgm_barcodes.fa
             --barcode_mismatch   Number of mismatches in barcode to allow. Default: 0
             --primer_mismatch   Number of mismatches in primers to allow. Default: 2
             --cpus              Number of CPUs to use. Default: all
             --mult_samples      Combine multiple chip runs, name prefix for chip

amptk illumina

Script for demultiplexing Illumina PE data that has been delivered from sequencing center in a folder of PE FASTQ files, one set for each sample. More information is here.

Usage:       amptk illumina <arguments>
version:     1.3.0

Description: Script takes a folder of Illumina MiSeq data that is already de-multiplexed
             and processes it for clustering using AMPtk.  The default behavior is to:
             1) merge the PE reads using USEARCH, 2) find and trim primers, 3) rename reads
             according to sample name, 4) trim/pad reads to a set length.

Arguments:   -i, --fastq         Input folder of FASTQ files (Required)
             -o, --out           Output folder name. Default: amptk-data
             -m, --mapping_file  QIIME-like mapping file
             -f, --fwd_primer    Forward primer sequence. Default: fITS7
             -r, --rev_primer    Reverse primer sequence Default: ITS4
             -l, --trim_len      Length to trim/pad reads. Default: 300
             -p, --pad           Pad reads with Ns if shorter than --trim_len. Default: off [on,off]
             --min_len           Minimum length read to keep. Default: 100
             --full_length       Keep only full length sequences.
             --reads             Paired-end or forward reads. Default: paired [paired, forward]
             --read_length       Illumina Read length (250 if 2 x 250 bp run). Default: auto detect
             --rescue_forward    Rescue Forward Reads if PE do not merge, e.g. long amplicons. Default: on [on,off]
             --require_primer    Require the Forward primer to be present. Default: on [on,off]
             --primer_mismatch   Number of mismatches in primers to allow. Default: 2
             --barcode_mismatch   Number of mismatches in barcode to allow. Default: 1
             --cpus              Number of CPUs to use. Default: all
             --cleanup           Remove intermediate files.
             --merge_method      Software to use for PE merging. Default: usearch [usearch,vsearch]
             -u, --usearch       USEARCH executable. Default: usearch9

amptk illumina2

This script is for demultiplexing Illumina data that is delivered as either a single FASTQ file or PE FASTQ files where the read layout contains unique barcode sequences at the 5’ or the 3’ end of the amplicons. More information is here.

Usage:       amptk illumina2 <arguments>
version:     1.3.0

Description: Script takes Illumina data that is not de-multiplexed and has read structure
             similar to Ion/454 such that the reads are <barcode><fwd_primer>Read<rev_primer> for
             clustering using AMPtk.  The default behavior is to: 1) find barcodes/primers,
             2) relabel headers and trim barcodes/primers, 3) merge the PE reads,
             4) trim/pad reads to a set length.  This script can handle dual barcodes
             (3' barcodes using the --reverse_barcode option or mapping file).

Arguments:   -i, --fastq            Illumina R1 (PE forward) reads (Required)
             --reverse              Illumina R2 (PE reverse) reads.
             -o, --out              Output base name. Default: illumina2
             -m, --mapping_file     QIIME-like mapping file
             -f, --fwd_primer       Forward primer sequence. Default: fITS7
             -r, --rev_primer       Reverse primer sequence Default: ITS4
             -n, --name_prefix      Prefix for re-naming reads. Default: R_
             -l, --trim_len         Length to trim/pad reads. Default: 300
             -p, --pad              Pad reads with Ns if shorter than --trim_len. Default: off [on,off]
             --min_len              Minimum length read to keep. Default: 100
             --barcode_fasta        FASTA file containing barcodes.
             --reverse_barcode      FASTA file containing R2 barcodes.
             --barcode_mismatch     Number of mismatches in barcode to allow. Default: 0
             --barcode_not_anchored Barcodes are not anchored to start of read.
             --full_length          Keep only full length sequences.
             --primer_mismatch      Number of mismatches in primers to allow. Default: 2
             --merge_method         Software to use for PE merging. Default: usearch [usearch,vsearch]
             --cpus                 Number of CPUs to use. Default: all
             -u, --usearch          USEARCH executable. Default: usearch9

amptk illumina3

This script demultiplexes Illumina PE data that is delivered as 3 files: forward reads (R1), reverse reads (R2), and then index reads (I3). More information is here.

Usage:       amptk illumina3/emp <arguments>
version:     1.3.0

Description: Script takes PE Illumina reads, Index reads, mapping file and processes
             data for clustering/denoising in AMPtk.  The default behavior is to:
             1) find and trim primers, 2) merge the PE reads, 3) filter for Phix,
             4) rename reads according to sample name, 4) trim/pad reads.

Arguments:   -f, --forward       FASTQ R1 (forward) file (Required)
             -r, --reverse       FASTQ R2 (reverse) file (Required)
             -i, --index         FASTQ I3 (index) file (Required)
             -m, --mapping_file  QIIME-like mapping file.
             -l, --trim_len      Length to trim/pad reads. Default: 300
             -p, --pad           Pad reads with Ns if shorter than --trim_len. Default: off [on,off]
             -o, --out           Output folder name. Default: amptk-data
             --fwd_primer        Forward primer sequence
             --rev_primer        Reverse primer sequence
             --min_len           Minimum length read to keep. Default: 100
             --read_length       Illumina Read length (250 if 2 x 250 bp run). Default: auto detect
             --rescue_forward    Rescue Forward Reads if PE do not merge, e.g. long amplicons. Default: on [on,off]
             --barcode_fasta     Multi-fasta file of barocdes.
             --primer_mismatch   Number of mismatches in primers to allow. Default: 2
             --barcode_mismatch  Number of mismatches in index (barcodes) to allow. Default: 2
             --barcode_rev_comp  Reverse complement barcode sequences in mapping file.
             --merge_method      Software to use for PE merging. Default: usearch [usearch,vsearch]
             --cpus              Number of CPUs to use. Default: all
             --cleanup           Remove intermediate files.
             -u, --usearch       USEARCH executable. Default: usearch9

amptk 454

Script for demultiplexing Roche 454 data. Input requirements are a 454 run in SFF, FASTQ, or FASTA+QUAL format as well as a multi-FASTA file containing barcodes used. More information is here.

Usage:       amptk 454 <arguments>
version:     1.3.0

Description: Script processes Roche 454 data for AMPtk clustering.  The input to this script
             should be either a SFF file, FASTA+QUAL files, or FASTQ file.  This script does
             the following: 1) finds barcode sequences, 2) relabels headers with appropriate
             barcode name, 3) removes primer sequences, 4) trim/pad reads to a set length.

Arguments:   -i, --sff, --fasta  Input file (SFF, FASTA, or FASTQ) (Required)
             -q, --qual          QUAL file (Required if -i is FASTA).
             -o, --out           Output base name. Default: out
             -m, --mapping_file  QIIME-like mapping file
             -f, --fwd_primer    Forward primer sequence. Default: fITS7
             -r, --rev_primer    Reverse primer sequence Default: ITS4
             -n, --name_prefix   Prefix for re-naming reads. Default: R_
             -l, --trim_len      Length to trim/pad reads. Default: 250
             -p, --pad           Pad reads with Ns if shorter than --trim_len. Default: off [on,off]
             --min_len           Minimum length read to keep. Default: 50
             --barcode_fasta     FASTA file containing barcodes. (Required)
             --reverse_barcode   FASTA file containing 3' barcodes. Default: none
             --barcode_mismatch  Number of mismatches in barcode to allow. Default: 0
             --primer_mismatch   Number of mismatches in primers to allow. Default: 2
             --cpus              Number of CPUs to use. Default: all

amptk SRA

This script is useful for pre-processing data from the NCBI SRA or data that is located in a folder where each sample is contained in a single FASTQ file. Note if you have PE Illumina data that was downloaded from SRA, you can use the amptk illumina script. More information is here.

Usage:       amptk SRA <arguments>
version:     1.3.0

Description: Script takes a folder of FASTQ files in a format you would get from NCBI SRA, i.e.
             there is one FASTQ file for each sample.  Reads will be named according to sample name
             and workflow is 1) find and trim primers, 2) rename reads according to filename,
             and 3) trim/pad reads to a set length (optional).

Arguments:   -i, --fastq         Input folder of FASTQ files (Required)
             -o, --out           Output folder name. Default: amptk-data
             -m, --mapping_file  QIIME-like mapping file
             -f, --fwd_primer    Forward primer sequence. Default: fITS7
             -r, --rev_primer    Reverse primer sequence Default: ITS4
             -l, --trim_len      Length to trim/pad reads. Default: 250
             -p, --pad           Pad reads with Ns if shorter than --trim_len. Default: off [on,off]
             --min_len           Minimum length read to keep. Default: 50
             --full_length       Keep only full length sequences.
             --require_primer    Require the Forward primer to be present. Default: on [on,off]
             --primer_mismatch   Number of mismatches in primers to allow. Default: 2
             --cpus              Number of CPUs to use. Default: all
             --cleanup           Remove intermediate files.
             -u, --usearch       USEARCH executable. Default: usearch9

AMPtk Clustering

amptk cluster

UPARSE clustering in AMPtk is completed with this command. There is optional reference based chimera filtering. More information is here.

Usage:       amptk cluster <arguments>
version:     1.3.0

Description: Script is a "wrapper" for the UPARSE algorithm. FASTQ quality trimming via expected
             errors and dereplication are run in vsearch if installed otherwise defaults to Python
             which allows for the use of datasets larger than 4GB.
             Chimera filtering and UNOISE are also options.

Arguments:   -i, --fastq         Input FASTQ file (Required)
             -o, --out           Output base name. Default: out
             -e, --maxee         Expected error quality trimming. Default: 1.0
             -p, --pct_otu       OTU Clustering Radius (percent). Default: 97
             -m, --minsize       Minimum size to keep (singleton filter). Default: 2
             --uchime_ref        Run Ref Chimera filtering. Default: off [ITS, LSU, COI, 16S, custom path]
             --map_filtered      Map quality filtered reads back to OTUs. Default: off
             --unoise            Run De-noising pre-clustering (UNOISE). Default: off
             --debug             Keep intermediate files.
             --cpus              Number of CPUs to use. Default: all
             -u, --usearch       USEARCH executable. Default: usearch9

amptk dada2

DADA2 infers exact sequence variants (ESVs or iSeqs) by using a statistical error model to correct sequencing errors. AMPtk employs a modified DADA2 workflow that also clusters the iSeqs into biological meaningful OTUs. More information is here.

Usage:       amptk dada2 <arguments>
version:     1.3.0

Description: Script is a "wrapper" for the DADA2 pipeline.  It will "pick OTUs" based on denoising
             the data for each read predicting the original sequence.  This pipeline is sensitive to
             1 bp differences between sequences. Since most reference databases classify "species"
             at 97%% threshold, the inferred sequences (iSeqs) from DADA2 are then clusterd at --pct_otu
             to create OTUs. Both results are saved.  Requires R packages: dada2, ShortRead

Arguments:   -i, --fastq         Input FASTQ file (Required)
             -o, --out           Output base name. Default: dada2
             -m, --min_reads     Minimum number of reads per sample. Default: 10
             -l, --length        Length to trim reads.
             -e, --maxee         Expected error quality trimming. Default: 1.0
             -p, --pct_otu       OTU Clustering Radius (percent). Default: 97
             --platform          Sequencing platform. [ion, illumina, 454]. Default: ion
             --pool              Pool all samples together for DADA2. Default: off
             --uchime_ref        Run Ref Chimera filtering. Default: off [ITS, LSU, COI, 16S, custom path]
             --cpus              Number of CPUs to use. Default: all
             --debug             Keep intermediate files.

amptk unoise2

UNOISE2 is a denoising algorithm in USEARCH9 that was built to work in a similar fashion to DADA2, correcting reads instead of clustering them. More information is here.

Usage:       amptk unoise2 <arguments>
version:     1.3.0

Description: Script will run the UNOISE2 denoising algorithm followed by clustering with
             UCLUST to generate OTUs. OTU table is then constructed by mapping reads to
             the OTUs.  Requires USEARCH v9.0.232 or greater.

Arguments:   -i, --fastq         Input FASTQ file (Required)
             -o, --out           Output base name. Default: out
             -e, --maxee         Expected error quality trimming. Default: 1.0
             -m, --minsize       Minimum size to keep for denoising. Default: 8
             -p, --pct_otu       OTU Clustering Radius (percent). Default: 97
             -u, --usearch       Path to USEARCH9. Default: usearch9
             --uchime_ref        Run Ref Chimera filtering. Default: off [ITS, LSU, COI, 16S, custom path]
             --cpus              Number of CPUs to use. Default: all
             --debug             Keep intermediate files.

amptk unoise3

UNOISE3 is the successor to UNOISE2 and is a denoising algorithm built from the Illumina platform. The author suggests that 454 and Ion Torrent data do not work well with this method. More information is here.

Usage:       amptk unoise3 <arguments>
version:     1.3.0

Description: Script will run the UNOISE3 denoising algorithm followed by clustering with
             UCLUST to generate OTUs. OTU table is then constructed by mapping reads to
             the OTUs.  Requires USEARCH v10.0.240 or greater.

Arguments:   -i, --fastq         Input FASTQ file (Required)
             -o, --out           Output base name. Default: out
             -e, --maxee         Expected error quality trimming. Default: 1.0
             -m, --minsize       Minimum size to keep for denoising. Default: 8
             -p, --pct_otu       OTU Clustering Radius (percent). Default: 97
             -u, --usearch       Path to USEARCH9. Default: usearch9
             --uchime_ref        Run Ref Chimera filtering. Default: off [ITS, LSU, COI, 16S, custom path]
             --cpus              Number of CPUs to use. Default: all
             --debug             Keep intermediate files.

amptk cluster_ref

This script runs reference based clustering or rather maps each unique sequence to a reference database using global alignment. If a sequence has no match greather than --id, the remaining sequences are classified using UTAX.

Usage:       amptk cluster_ref <arguments>
version:     1.3.0

Description: Script first quality filters reads, dereplicates, and then runs chimera
             filtering.  OTUs are then picked via reference based clustering (closed)
             those that are > --id.  The rest of the data can then be clustered via
             de novo UPARSE and then reference clustered using UTAX.  EXPERIMENTAL

Arguments:   -i, --fastq         Input FASTQ file (Required)
             -d, --db            Database [ITS,ITS1,ITS2,16S,LSU,COI,custom]. (Required)
             -o, --out           Output base name. Default: out
             -e, --maxee         Expected error quality trimming. Default: 1.0
             -p, --pct_otu       OTU Clustering Radius (percent). Default: 97
             -m, --minsize       Minimum size to keep (singleton filter). Default: 2
             --id                Percent ID for closed reference clustering. Default: 97
             --utax_db           UTAX formatted DB.
             --utax_level        UTAX Taxonomy level to keep. Default: k [k,p,c,o,f,g,s]
             --utax_cutoff       UTAX confidence value threshold. Default: 0.8 [0 to 0.9]
             --mock              Mock community fasta file
             --closed_ref_only   Run only closed reference clustering.
             --map_filtered      Map quality filtered reads back to OTUs. Default: off
             --debug             Keep intermediate files.
             --cpus              Number of CPUs to use. Default: all
             -u, --usearch       USEARCH executable. Default: usearch9

AMPtk Utilities

amptk filter

Removing index-bleed or sample cross-over from datasets is important for downstream community ecology analysis. AMPtk utilizes a mock community as reference point for calculating the rate of index-bleed between samples. It than uses that value to remove read counts from an OTU table that fall below the index-bleed threshold. Each OTU is calculated separately, so that low-abundance OTUs are not indiscriminately removed. More information can be found here.

Usage:       amptk filter <arguments>
version:     1.3.0

Description: Script filters OTU table generated from the `amptk cluster` command and should
             be run on all datasets to combat barcode-switching or index-bleed (as high as
             2%% in MiSeq datasets, ~ 0.3%% in Ion PGM datasets).  This script works best when
             a spike-in control sequence is used, e.g. Synthetic Mock, although a mock is not required.

Required:    -i, --otu_table     OTU table
             -f, --fasta         OTU fasta

Optional:    -o, --out           Base name for output files. Default: use input basename
             -b, --mock_barcode  Name of barcode of mock community (Recommended)
             -m, --mc            Mock community FASTA file. Required if -b passed. [synmock,mock1,mock2,mock3,other]
             -c, --calculate     Calculate index-bleed options. Default: all [in,all]
             -d, --drop          Sample(s) to drop from OTU table. (list, separate by space)
             --negatives         Negative sample names. (list, separate by space)
             --ignore            Ignore sample(s) during index-bleed calc (list, separate by space)

Filtering    -n, --normalize     Normalize reads to number of reads per sample [y,n]. Default: y
             -p, --index_bleed   Filter index bleed between samples (percent). Default: 0.005
             -t, --threshold     Number to use for establishing read count threshold. Default: max [max,sum,top5,top10,top25]
             -s, --subtract      Threshold to subtract from all OTUs (any number or auto). Default: 0
             --delimiter         Delimiter of OTU tables. Default: tsv  [csv, tsv]
             --min_reads_otu     Minimum number of reads for valid OTU from whole experiment. Default: 2
             --min_samples_otu   Minimum number of samples for valid OTU from whole experiment. Default: 1
             --col_order         Column order (separate by space). Default: sort naturally
             --keep_mock         Keep Spike-in mock community. Default: False
             --show_stats        Show OTU stats on STDOUT
             --debug             Keep intermediate files.
             -u, --usearch       USEARCH executable. Default: usearch9

amptk lulu

Script runs LULU post-clustering OTU table filtering. see doi:10.1038/s41467-017-01312-x

Usage:       amptk lulu <arguments>
version:     1.3.0

Description: Script is a wrapper for the LULU OTU table post-clustering curation of amplicon
             data. The script calculates pairwise identity between the OTUs and then filters
             the OTU table based on whether closely related OTUs that share the same/similar
             distributions in the data are "daughters" of the "parent" OTU. Requires R and the
             LULU R package. doi:10.1038/s41467-017-01312-x

Arguments:   -i, --otu_table            Input OTU table (Required)
             -f, --fasta                Input OTUs in FASTA format (Required)
             -o, --out                  Output base name. Default: input basename
             --min_ratio_type           Minimum ratio threshold. Default: min [min,avg]
             --min_ratio                Minimum ratio. Default: 1
             --min_match                Minimum match pident (%%). Default: 84
             --min_relative_cooccurence Minimum relative co-occurance (%%): Default: 95
             --debug                    Keep intermediate files.

amptk taxonomy

This script assigns taxonomy to OTUs and an OTU table. A variety of methods are available, more details are located here.

Usage:       amptk taxonomy <arguments>
version:     1.3.0

Description: Script maps OTUs to taxonomy information and can append to an OTU table (optional).
             By default the script uses a hybrid approach, e.g. gets taxonomy information from
             SINTAX, UTAX, and global alignment hits from the larger UNITE-INSD database, and
             then parses results to extract the most taxonomy information that it can at 'trustable'
             levels. SINTAX/UTAX results are used if BLAST-like search pct identity is less than 97%%.
             If % identity is greater than 97%, the result with most taxonomy levels is retained.
             Run amptk info to see taxonomy databases installed.

Arguments:   -f, --fasta         Input FASTA file (i.e. OTUs from amptk cluster) (Required)
             -i, --otu_table     Input OTU table file (i.e. otu_table from amptk cluster)
             -o, --out           Base name for output file. Default: amptk-taxonomy.<method>.txt
             -d, --db            Select Pre-installed database [ITS1, ITS2, ITS, 16S, LSU, COI]. Default: ITS2
             -m, --mapping_file  QIIME-like mapping file
             -t, --taxonomy      Taxonomy calculated elsewhere. 2 Column file.
             --method            Taxonomy method. Default: hybrid [utax, sintax, usearch, hybrid, rdp, blast]
             --add2db            Add FASTA files to DB on the fly.
             --fasta_db          Alternative database of fasta sequenes to use for global alignment.
             --utax_db           UTAX formatted database. Default: ITS2.udb [See configured DB's below]
             --utax_cutoff       UTAX confidence value threshold. Default: 0.8 [0 to 0.9]
             --usearch_db        USEARCH formatted database. Default: USEARCH.udb
             --usearch_cutoff    USEARCH threshold percent identity. Default 0.7
             --sintax_cutoff     SINTAX confidence value threshold. Default: 0.8 [0 to 0.9]
             -r, --rdp           Path to RDP Classifier. Required if --method rdp
             --rdp_db            RDP Classifer DB set. [fungalits_unite, fungalits_warcup. fungallsu, 16srrna]
             --rdp_cutoff        RDP Classifer confidence value threshold. Default: 0.8 [0 to 1.0]
             --local_blast       Local Blast database (full path) Default: NCBI remote nt database
             --tax_filter        Remove OTUs from OTU table that do not match filter, i.e. Fungi to keep only fungi.
             -u, --usearch       USEARCH executable. Default: usearch9
             --cpus              Number of CPUs to use. Default: all
             --debug             Keep intermediate files

amptk show

This utility will count the number of reads for each sample from a demultiplexed FASTQ sample. Additionally it measures read length for the entire dataset and allows you to quality trim using expected errors. Note quality trimming is slow in this script and isn’t intended to be used for normal amplicon dataset processing.

Usage:       amptk show <arguments>
version:     1.3.0

Description: Script takes de-multiplexed data (.demux.fq) as input and counts reads per barcode.

Required:    -i, --input     Input FASTQ file (.demux.fq)
             --quality_trim  Quality trim reads
             -e, --maxee     maxEE threshold for quality. Default: 1.0
             -l, --length    truncation length for trimming: Default: 250
             -o, --out       Output FASTQ file name (--quality_trim only)

amptk select

This script allows you to keep samples from a demultiplexed FASTQ sample, useful for keeping samples that have higher than a --threshold number of reads.

Usage:       amptk select <arguments>
version:     1.0.0

Description: Script filters de-multiplexed data (.demux.fq) to select only reads from samples
             provided in a text file, one name per line or pass a list to keep to --list.

Required:    -i, --input      Input FASTQ file (.demux.fq)
             -t, --threshold  Keep samples with read count greater than -t
             -l, --list       List of sample (barcode) names to keep, separate by space
             -f, --file       List of sample (barcode) names to keep in a file, one per line
             -o, --out        Output file name
             --format         File format for output file. Default: fastq [fastq, fasta]

amptk remove

This script allows you to drop samples from a demultiplexed FASTQ sample, useful for removing samples that have low read counts or are from potentially a different project.

Usage:       amptk select <arguments>
version:     1.3.0

Description: Script filters de-multiplexed data (.demux.fq) to select only reads from samples
             provided in a text file, one name per line or pass a list to keep to --list.

Required:    -i, --input      Input FASTQ file (.demux.fq)
             -t, --threshold  Keep samples with read count greater than -t
             -l, --list       List of sample (barcode) names to keep, separate by space
             -f, --file       List of sample (barcode) names to keep in a file, one per line
             -o, --out        Output file name
             --format         File format for output file. Default: fastq [fastq, fasta]

amptk sample

This script will sub-sample or pseudo-rarefy a dataset to an equal number of reads per sample. Note, this should not be used during standard amplicon community analysis, however, there are some fringe use cases where it is appropriate.

Usage:       amptk sample <arguments>
version:     1.3.0

Description: Script sub-samples (rarifies) de-multiplexed data to equal number of reads per
             sample. For community analysis, this might not be appropriate as you are ignoring
             a portion of your data, however, there might be some applications where it is useful.

Required:    -i, --input       Input FASTQ file
             -n, --num_reads   Number of reads to sub-sample to
             -o, --out         Output FASTQ file name

amptk drop

This script allows you to drop OTUs from an OTU table. Usage example would be that you identify OTUs that are from contamination and you want to remove them from the OTU table.

Usage:       amptk drop <arguments>
version:     1.3.0

Description: Script drops OTUs from dataset and outputs new OTU table

Required:    -i, --input     Input OTU file (.cluster.otus.fa) (FASTA)
             -r, --reads     Demultiplexed reads (.demux.fq) (FASTQ)
             -l, --list      List of OTU names to remove, separate by space
             -f, --file      List of OTU names to remove in a file, one per line
             -o, --out       Output file name. Default: amptk-drop

amptk stats

This script is a wrapper for Vegan/Phyloseq and is meant as a first pass overview of your community ecology data. The script takes a BIOM file containing OTU table, taxonomy, and metadata (output of amptk taxonomy). The script than loops through all metadata and returns a hypothesis test (Adonis and Betadisper), an NMDS graph of the data, and an alpha diversity graph. This script requires R, Vegan, and Phyloseq. Script is considered beta as it is new.

Usage:       amptk stats <arguments>
version:     1.3.0

Description: A wrapper script for Phyloseq and Vegan R packages that draws NMDS of all
             treatments in a BIOM file (output from amptk taxonomy). The script also runs
             hypothesis tests (Adonis and Betadispersion) for each treatment.

Arguments:   -i, --biom          Input BIOM file with taxonomy and metadata (Required)
             -t, --tree          Phylogeny of OTUs (from amptk taxonomy) (Required)
             -d, --distance      Distance metric. Default: raupcrick [raupcrick,jaccard,bray,unifrac,wunifrac]
             -o, --out           Output base name. Default: amptk_stats
             --ignore_otus       Drop OTUs from table before running stats

Example 1:

amptk stats -i test.biom -t test.tree.phy -o test_stats
-------------------------------------------------------
[Feb 06 09:08 PM]: OS: MacOSX 10.14.3, 8 cores, ~ 17 GB RAM. Python: 3.6.7
[Feb 06 09:08 PM]: R v3.5.1; Phyloseq v1.26.0
[Feb 06 09:08 PM]: Running hypothesis test using bray distance metric on all treatments, drawing NMDS for each.
[Feb 06 09:10 PM]: HTML output files were generated for each treatment: test_stats
-------------------------------------------------------

The script will loop through your treatments and generate an HTML page for each treatment displaying an NMDS ordination, alpha diversity, and some summary stats. Hover-over is enabled on the ordination so you can easily identify outliers.

_images/stats-page.png

amptk summarize

This script will traverse the taxonomy tree from an OTU table that is appended with taxonomy information, i.e. the output of amptk taxonomy. It can optionally produce stacked bar graphs of taxonomy for each level of taxonomy.

Usage:       amptk summarize <arguments>
version:     1.3.0

Description: Script traverses the taxonomy information and creates an OTU table for each
             level of taxonomy, i.e. Kingdom, Phylum, Class, etc.  Optionally, it will
             create a Stacked Bar Graph for each taxonomy levels for each sample. Requires
             Matplotlib, numpy, and pandas.

Arguments:   -i, --table     OTU Table containing Taxonomy information (Required)
             -o, --out       Base name for output files. Default: amptk-summary
             --graphs        Create stacked Bar Graphs.
             --format        Image output format. Default: eps [eps, svg, png, pdf]
             --percent       Convert numbers to Percent for Graphs. Default: off
             --font_size     Adjust font size for X-axis sample lables. Default: 8

Example 1:

amptk summarize -i test.otu_table.taxonomy.txt --graphs -o test --font_size 6 --format pdf
_images/summarize.pdf

Example 2:

amptk summarize -i test.otu_table.taxonomy.txt --graphs -o test --font_size 6 --format pdf --percent
_images/summarize-percent.pdf

amptk funguild

FunGuild is a tool for assigning functional information to OTUs. You use this script by simply providing an OTU table that has been appended with taxonomy, i.e. the otu_table.taxonomy.txt from amptk taxonomy.

Usage:       amptk funguild <arguments>
version:     1.3.0

Description: Script takes OTU table as input and runs FUNGuild to assing functional annotation to an OTU
             based on the Guilds database.  Guilds script written by Zewei Song (2015).

Options:     -i, --input        Input OTU table
             -d, --db           Database to use [fungi, nematode]. Default: fungi
             -o, --out          Output file basename.

amptk meta

This script is an alternative to using BIOM file format for downstream processing. It takes a metadata file in CSV format with the first column having sample IDs that match sample IDs in an OTU table. The script than pivots the OTU table and appends it to the metadata, which can be imported into something like Vegan in R.

Usage:       amptk meta <arguments>
version:     1.3.0

Description: Script takes meta data file in CSV format (e.g. from excel) and an OTU table as input.
             The first column of the meta data file must match the OTU table sample headers exactly.
             It then pivots the OTU table and appends it to the meta data file.

Required:    -i, --input       Input OTU table
             -m, --meta        Meta data table (csv format)
             -o, --out         Output (meta data + pivotted OTU table)
             --split_taxonomy  Make separate tables for groups of taxonomy [k,p,c,o,f,g]

amptk heatmap

Transform your OTU table into a heatmap using Seaborn and Matplotlib.

Usage:       amptk heatmap <arguments>
version:     1.3.0

Description: Script creates a heatmap from an OTU table.  Several settings are customizable.
             Requires Seaborn, matplotlib, numpy, and pandas.

Arguments:   -i, --input         Input OTU table (Required)
             -o, --output        Output file (Required)
             -m, --method        Type of heatmap. Default: clustermap [clustermap,heatmap]
             -d, --delimiter     Delimiter of OTU table. Default: tsv [tsv,csv]
             -f, --format        Figure format. Default: pdf [pdf,jpg,svg,png]
             --font              Font set. Default: arial
             --color             Color Palette. Default: gist_gray_r
             --figsize           Figure size. Default: 2x8
             --annotate          Annotate heatmap with values.
             --distance_metric   Distance metric to use for clustermap. Default: braycurtis
             --cluster_columns   Cluster the columns (samples). Default: False [True,False]
             --cluster_method    Clustering method for clustermap. Default: single [single,complete,average,weighted]
             --scaling           Scale the data by row. Default: None [None, z_score, standard]
             --yaxis_fontsize    Y-Axis Font Size. Default: 6
             --xaxis_fontsize    X-Axis Font Size. Default: 6
             --normalize         Normalize data based total, tsv file ID<tab>count
             --normalize_counts  Value to normalize counts to, i.e. 100000
             --vmax              Maximum value for heatmap coloration.
             --debug             Print pandas table on import to terminal

amptk SRA-submit

Submitting your data to NCBI SRA can be a real pain, this script tries to make it easier to make that happen. Data submitted to SRA needs to be split up by sample, however it should also be minimally processed -> what I mean by that is that Illumina data should be raw (output of bcl2fastq for example) and 454/Ion Torrent data should be demultiplexed based on sample, but otherwise should not be trimmed. This is where amptk SRA-submit can help. The script takes the raw input and outputs gzipped FASTQ files that are minimally processed for SRA. Moreover, if you create a BioProject and BioSamples for each of your samples prior to running the script, you can bass the BioSample worksheet from NCBI to the script and it will automatically generate an SRA submission file. You can customize some of the text in that file, i.e. via the --description argument.

Usage:       amptk SRA-submit <arguments>
version:     1.3.0

Description: Script aids in submitted your data to NCBI Sequence Read Archive (SRA) by splitting
             FASTQ file from Ion, 454, or Illumina by barcode sequence into separate files for
             submission to SRA.  This ensures your data is minimally processed as only barcodes
             are removed.  However, you can assert that primers must be found in order for
             sequences to be processed.  Additionally, you can pass the --biosample argument
             with an NCBI biosample tab-delimited file and the script will auto-populate an
             SRA submission file.

Arguments:   -i, --input         Input FASTQ file or folder (Required)
             -o, --out           Output base name. Default: sra
             -m, --mapping_file  QIIME-like mapping file.
             -b, --barcode_fasta Mulit-fasta file containing barcodes used.
             -s, --biosample     BioSample worksheet from NCBI (from confirmation email)
             -p, --platform      Sequencing platform. Defalt: ion (ion, illumina, 454)
             -n, --names         CSV name mapping file, e.g. BC_1,NewName
             -d, --description   Paragraph description for SRA experimental design. Use quotes to wrap paragraph.
             -f, --fwd_primer    Forward primer sequence. Default: fITS7
             -r, --rev_primer    Reverse primer sequence. Default: ITS4
             -a, --append        Append a name to the output of all files in run, i.e. run1 -> Sample_run1
             --primer_mismatch   Number of mismatches allowed in primer search. Default: 2
             --barcode_mismatch  Number of mismatches in barcode to allow. Default: 0
             --require_primer    Require primer(s) to be present for output. Default: off [off,forward,both]
             --min_len           Minimum length read to keep after trimming barcodes. Default 50
             ---force            Overwrite directory with same name

AMPtk Setup

amptk install

This simple script will download and unpack the pre-build reference databases.

Usage:       amptk install <arguments>
version:     1.3.0

Description: Script downloads pre-formated databases for use with the `amptk taxonomy`
             command. You can download databases for fungal ITS, bacterial 16S, fungal
             LSU, or arthropod/chordate COI amplicons.

Arguments:   -i            Install Databases. Choices: ITS, 16S, LSU, COI
             --force       Over-write existing databases

amptk database

Building reference databases is done with amptk database. It has built-in parsers for UNITE and RDP FASTA headers, see the discussion about AMPtk taxonomy <taxonomy> for more information on FASTA headers.

Usage:       amptk database <arguments>
version:     1.3.0

Description: Setup/Format reference database for amptk taxonomy command.

Arguments:   -i, --fasta         Input FASTA file
             -o, --out           Base name for output files, i.e. ITS2
             -f, --fwd_primer    Forward primer. Default: fITS7
             -r, --rev_primer    Reverse primer. Default: ITS4
             --format            Reformat FASTA headers to UTAX format. Default: unite2utax [unite2utax, rdp2utax, off]
             --drop_ns           Removal sequences that have > x N's. Default: 8
             --create_db         Create a DB. Default: usearch [utax, usearch]
             --skip_trimming     Keep full length sequences. Default: off
             --derep_fulllength  Remove identical sequences.
             --lca               Run LCA (last common ancestor) on taxonomy if dereplicating sequences.
             --min_len           Minimum length to keep. Default: 100
             --max_len           Maximum length to keep. Default: 1200
             --trunclen          Truncate records to length.
             --subsample         Random subsample reads.
             --primer_mismatch   Max Primer Mismatch. Default: 2
             --keep_all          Keep Sequence if forward primer not found.
             --utax_trainlevels  UTAX custom parameters. Default: kpcofgs
             --utax_splitlevels  UTAX custom parameters. Default: NVkpcofgs
             --cpus              Number of CPUs to use. Default: all
             --install           Install into AMPtk Database
             -u, --usearch       USEARCH executable. Default: usearch9

amptk primers

This command lists the primers that are available via their names. You can always input the actual primer sequence.

----------------------------------
Primers hard-coded into AMPtk:
----------------------------------
16S_V3       CCTACGGGNGGCWGCAG
16S_V4       GACTACHVGGGTATCTAATCC
515FB        GTGYCAGCMGCCGCGGTAA
806RB        GGACTACNVGGGTWTCTAAT
COI-F        GGTCAACAAATCATAAAGATATTGG
COI-R        GGWACTAATCAATTTCCAAATCC
ITS1         TCCGTAGGTGAACCTGCGG
ITS1-F       CTTGGTCATTTAGAGGAAGTAA
ITS2         GCTGCGTTCTTCATCGATGC
ITS3         GCATCGATGAAGAACGCAGC
ITS3_KYO2    GATGAAGAACGYAGYRAA
ITS4         TCCTCCGCTTATTGATATGC
ITS4-B       CAGGAGACTTGTACACGGTCCAG
ITS4-B21     CAGGAGACTTGTACACGGTCC
JH-LS-369rc  CTTCCCTTTCAACAATTTCAC
LR0R         ACCCGCTGAACTTAAGC
LR2R         AAGAACTTTGAAAAGAG
fITS7        GTGARTCATCGAATCTTTG

AMPtk Downstream Tools

AMPtk to Phyloseq

Importing a BIOM table from AMPtk into Phyloseq is relatively straightforward:

#load in biom and tree file
physeq <- import_biom('mydata.biom', 'mydata.tree.phy')

#rename taxonomy headings
colnames(tax_table(physeq)) <- c("Kingdom", "Phylum", "Class",
  "Order", "Family", "Genus", "Species")

#print the sample variables
sample_variables(physeq)

AMPtk to QIIME

The BIOM file produced by amptk taxonomy is compatible directly with QIIME community ecology scripts, assuming that you have a QIIME compatible mapping file:

summarize_taxa_through_plots.py -i otu_table.biom -o taxa_summary -m mapping_file.txt

AMPtk to Online Tools

The BIOM output is also compatible with Phinch web based visualization as well as MetaCoMET.

AMPtk is a series of scripts to process NGS amplicon data using USEARCH and VSEARCH, it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data. At least USEARCH v9.1.13 and VSEARCH v2.2.0 were required as of AMPtk v0.7.0.

Update November 2021: As of AMPtk v1.5.1, USEARCH is no longer a dependency as all of the processing can be done with VSEARCH 64-bit. This means that default/hybrid taxonomy assignment uses just global alignment and SINTAX – however the added benefit here is that we are not constrained by the 4 GB RAM limit of the 32-bit version of USEARCH.

Citation

Palmer JM, Jusino MA, Banik MT, Lindner DL. 2018. Non-biological synthetic spike-in controls
        and the AMPtk software pipeline improve mycobiome data. PeerJ 6:e4925;
        DOI 10.7717/peerj.4925. https://peerj.com/articles/4925/

Install

There are several ways to install AMPtk, the easiest and recommended way is with Conda

#setup your conda env with bioconda, type the following in order to setup channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

#create amptk env (optional)
conda create -n amptk amptk

You can install the python portion of AMPtk with pip, but you will need to then install the external dependencies such as usearch, vsearch, DADA2 and the amptk stats script will need to install R dependencies.

pip install amptk

Users can also install manually, download a release. You can also build the latest unreleased version from github:

#clone the repository
git clone https://github.com/nextgenusfs/amptk.git

#then install, optional add --prefix to control location
python setup.py install --prefix /User/Tools/amptk

Dependencies Requiring Manual Install for Older version of AMPtk (Deprecated)

  1. AMPtk utilizes USEARCH9 which must be installed manually from the developer here. Obtain the proper version of USEARCH v9.2.64 and softlink into the PATH:
#make executable
sudo chmod +x /path/to/usearch9.2.64_i86osx32

#create softlink
sudo ln -s /path/to/usearch9.2.64_i86osx32 /usr/local/bin/usearch9

1b) (optional) One script also requires USEARCH10, so you can download usearch10 and put into your path as follows:

#make executable
sudo chmod +x /path/to/usearch10.0.240_i86osx32

#create softlink
sudo ln -s /path/to/usearch10.0.240_i86osx32 /usr/local/bin/usearch10
  1. (optional) LULU post-clustering OTU table filtering via amptk lulu requires the R package LULU. Install requires devtools.
#install devtools if you don't have already
install.packages('devtools')
library('devtools')
install_github("tobiasgf/lulu")

#not listed as dependency but on my system also requires dpylr
install.packages('dpylr') or perhaps all of tidyverse install.packages('tidyverse')

#could also install tidyverse from conda
conda install r-tidyverse

Dependencies installed via package managers

You only need to worry about these dependencies if you installed manually and/or some will be necessary if used homebrew for installation (for example homebrew doesn’t install R packages)

  1. AMPtk requires VSEARCH, which you can install from here. Note, if you use homebrew recipe it will be install automatically or can use conda.
#install vsearch with homebrew
brew install vsearch

#or with bioconda
conda install -c bioconda vsearch
  1. Several Python modules are also required, they can be installed with pip or conda:
#install with pip
pip install -U biopython natsort pandas numpy matplotlib seaborn edlib biom-format psutil

#install with conda
conda install biopython natsort pandas numpy matplotlib seaborn python-edlib biom-format psutil
  1. (optional) DADA2 denoising algorithm requires installation of R and DADA2. Instructions are located here.
#install with conda/bioconda
conda install r-base bioconductor-dada2
  1. (optional) To run some preliminary community ecology stats via amptk stats you will also need the R package Phyloseq. One way to install with conda:
#install with conda/bioconda
conda install r-base bioconductor-phyloseq

Run from Docker

There is a base installation of AMPtk on Docker at nextgenusfs/amptk-base and then one with taxonomy databases at nextgenusfs/amptk. I’ve written a wrapper script that will run the docker image, simply have to download the script and ensure its executable.

  1. Download the Dockerfile build file.
$ wget -O amptk-docker https://raw.githubusercontent.com/nextgenusfs/amptk/master/amptk-docker
$ chmod +x amptk-docker

More Information