ERVsearch Documentation¶
Introduction¶
ERVsearch is a pipeline for identification of endogenous retrovirus like regions in a host genome, based on sequence similarity to known retroviruses.
ERVsearch screens for endogenous retrovirus (ERV) like regions in any FASTA file using the Exonerate algorithm (Slater and Birney, 2005, doi:10.1186/1471-2105-6-31).
In the Screen section, open reading frames (ORFs) resembling retroviral gag, pol and env genes are identified based on their level of similarity to a database of known complete or partial retroviral ORFs.
In the Classify section, these ORFs are classified into groups based on a database of currently classified retroviruses and phylogenetic trees are built.
In the ERVRegions section, regions with ORFs resembling more than one retroviral gene are identified.
This is a updated and expanded version of the pipeline used to identify ERVs in Brown and Tarlinton 2017 (doi: 10.1111/mam.12079), Brown et al. 2014 (doi: 10.1128/JVI.00966-14), Brown et al. 2012 (doi: j.virol.2012.07.010) and Tarlinton et al. 2012 (doi: 10.1016/j.tvjl.2012.08.011). The original version is available here as a Perl pipeline and was written by Dr Richard Emes.
Prerequisites¶
The pipeline is currently available for Unix-based systems only.
The ERVsearch pipeline requires the following freely available software. All packages are available via pip and easy_install
Python 3.5+ with the following packages:
The following commonly used software needs to be installed and in your $PATH
The following software also needs to be installed
Installation¶
The latest release of ERVsearch is available via pip3.
pip3 install ERVsearch
Alternatively, if you prefer to install directly, the latest release can be downloaded from Github.
The latest (beta) version can also be cloned from github
git clone https://github.com/KatyBrown/ERVsearch.git
No compliation is required, just add the ERVsearch directory to your path or use the full path to ERVsearch/ERVsearch
. If installed using pip you should be able to call ERVsearch
directly.
Quick Start¶
After cloning the repository, the program can be used as is (with the above prerequisites installed).
Make a copy of the pipeline.ini file (ERVsearch/templates/pipeline.ini or here) in your working directory (the directory in which you would like to store the output).
Download a local copy of your genome (or other sequence) of interest as a single FASTA file.
Edit your copy of pipeline.ini to configure the pipeline for your computer:
Add the path to the genome you want to screen (the fasta file in step 2) to the genome section e.g. hg38.fa saved in /home/myname/genome/hg38.fa would require the following options:
[genome]
file_=/home/myname/genome/hg38.fa
Add the paths to usearch and exonerate to the paths section e.g.
[paths]
path_to_usearch=/home/myname/usearch/usearch11.0.667_i86linux32
path_to_exonerate=/home/myname/exonerate/bin/exonerate
Run the pipeline in your working directory as:
ERVsearch --target_tasks full -v5
Pipeline Description¶
The pipeline is designed to identify regions resembling retroviral gag, pol and env genes in a genome (or other set of sequences) and to perform various analyses on these regions.
It is divided into three sections:
Screen¶
Screens the genome for ERV like regions by comparing the genome to a set of known retroviral ORFs using Exonerate.
Confirms the Exonerate regions using UBLAST
Finds and confirms ORFs within these regions
Finds the most similar known retroviral ORF in the database to each of the newly identified ORFs
Classify¶
Classifies the newly identified ORFs into groups based on the most similar known ORF
Aligns the newly identified ORFs with reference sequences within these groups and builds a phylogenetic tree for each group.
Finds clusters of newly identified ORFs within these trees
Incorporates representative sequences from these clusters into a summary tree for each retroviral gene and genus (based on classification into gamma, beta, spuma, alpha, lenti, epsilon and delta retroviruses as definied by the ICTV (https://talk.ictvonline.org/taxonomy).
Input Files¶
Required Input Files¶
1: FASTA file to screen for ERVs
The main input file is sequence file in FASTA format containing DNA sequences from the genome which you wish to screen for ERV-like regions. This would usually be a reference or de novo assembled genome but can be any set of DNA sequences.
Reference genome sequences are available from Ensembl and UCSC (amongst others).
To be used as an input file, the reference genome needs to be contained in a single FASTA file.
For Ensembl genomes, this would usually be the GENOMEID.dna.toplevel.fa.gz
file from the “download DNA sequence” page for the appropriate organism, substituting GENOMEID
for the genome ID (e.g. GRCh38)
For UCSC genomes this would be GENOMEID.fa.gz
from the bigZips directory for this organism on the FTP server, substituting GENOMEID
for the genome ID (e.g. hg38)
It is possible to use a gzipped or zipped file, in which case the filename needs to end with .gz or .zip respectively.
2: pipeline.ini
file
This file is a configuration file in ini format containing the parmeters you wish to use. This file needs to be in your working directory - the folder in which you wish to run ERVsearch.
A template pipeline.ini
file should be used and edited - this file is available as ERVsearch/templates/pipeline.ini
or here
Options specified as !?
are required, all others have a default value.
Optional Input Files¶
Custom Databases¶
By default, ERVsearch will use the provided database of 774 ERV nucleotide sequences and corresponding amino acid sequences as a query against the provided genome. This database is designed to be representative of known retroviruses and to identify the majority of ERVs. However, a more specific custom database can also be provided and used for the initial screen.
To do this, the database_use_custom_db parameter in the pipeline.ini can be set to True
. The query sequences should be stored as FASTA files of amino acid sequences, with one file per retroviral gene. Only gag, pol and env genes are currently supported. Very short sequences (less than ~100 amino acids) should be avoided where possible.
The paths to these files are then specified in the database section of the pipeline.ini
e.g.
[database]
use_custom_db=True
gag=/home/katy/my_databases/gag_ervs.fasta
pol=/home/katy/my_databases/pol_ervs.fasta
env=/home/katy/my_databases/env_ervs.fasta
Currently, custom databases are only used for the initial Exonerate screen and UBLAST check, after this all classification based steps will use the default databases, as these sequences have been classified into subgroups for phylogenetic analysis.
Sequence List¶
keep_chroms.txt
A list of chromosome names to include. The names should the names in the fasta file cropped at the first space, e.g. “NW_006711271.1 Panthera tigris altaica isolate TaeGuk unplaced genomic scaffold” should just be listed as NW_006711271.1. The names should be listed with one name per line, are case sensitive and need to be identical to those in the fasta file. This file needs to be named keep_chroms.txt and in the working directory.
Databases¶
Several sets of reference sequences are provided as part of this package.
For each retroviral gene (gag, pol and env), a representative set of retroviral open reading frames has been selected from NCBI Genbank and various publications.
These files are provided as:
ERVsearch/ERV_db/gag.fasta
- gag gene amino acid sequences
ERVsearch/ERV_db/pol.fasta
- pol gene amino acid sequences
ERVsearch/ERV_db/env.fasta
- env gene amino acid sequences
ERVsearch/ERV_db/all_ERVs_nt.fasta
- all ORFs as nucleotide sequences
ERVsearch/ERV_db/all_ERVs_aa.fasta
- all ORFs as amino acid sequences
A number of subsets of sequences are also provided to use in phylogenetic analysis.
These are:
ERVsearch/phylogenies/group_phylogenies/*fasta
Small groups of nucleotide sequences from the ORF database which are closely related, selected manually as representatives of these groups based on prior knowledge, sequence similarity and phylogenetic analysis. Newly identified sequences are assigned to these groups where possible.
ERVsearch/phylogenies/summary_phylogenies/*fasta
Broader groups of nucleotide sequences from the ORF database for each gene (gag, pol and env) and each genus (gamma, beta, delta, alpha, epsilon, lenti and spuma). Newly identified sequences are incorporated into phylogenetic trees based on these sequences, plus more closely related sequences from the group_phylogenies
fasta files.
Two addtional files are also provided:
ERVsearch/ERV_db/convert.tsv
Table showing the group each reference sequence belongs to.
ERVsearch/phylogenies/outgroups.tsv
Table providing the name of an appropriate outgroup for each phylogeny.
If you want to see these files (ERVsearch will locate them automatically for internal use), then if you are using a clone of the git repository, database files can be found in ERVsearch/ERV_db
and ERVsearch/phylogenies
. If you installed using pip, they will be in the same location in the ERVsearch directory in your python site-packages directory.
Usage¶
Running the Pipeline¶
The pipeline is implemented using the pipeline development package ruffus, (Goodstadt 2010, doi:10.1093/bioinformatics/btq524).
To run the full pipeline, the following command is used:
ERVsearch --target_tasks full
If the pipeline stops or fails at any point, it will restart from after the previous fully completed step (based on the presence of the output files from this step). If the output of an earlier step in the pipeline has a more recent timestamp than the output of a later step, the later step will be rerun.
Sections of the pipeline can be run as follows:
ERVsearch --target_tasks Screen
Screen with Exonerate and check the results with UBLAST, find ORFs and find the most similar known retroviral ORF.
ERVsearch --target_tasks Classify
Run the Screen steps and then sort the sequences into subgroups, build phylogenetic trees for these groups and a summary tree for each gene and genus.
ERVsearch --target_tasks ERVRegions
Run the Screen steps and then find regions with ORFs resembling more than one retroviral gene in close proximity.
ERVsearch --target_tasks full
Run all the above sections.
You can also run the pipeline up until any specific function - any function name can be provided for the target_tasks
parameter.
For example to run up until the end of the function classifyWithExonerate
, use the following command.
ERVsearch --target_tasks classifyWithExonerate
All functions prior to this function will run if needed.
Parallelisation¶
The pipeline is paralellised to run jobs simultaneously where possible. To do this, set the parameter--jobs N
in the command line, where N is the number of CPUs available on your machine.
e.g.
ERVsearch --target_tasks full --jobs 8
This would run on 8 CPUs
If running on a high performance cluster, it is recommended to use a single node and set --jobs
to the number of cores available on that node.
Verbosity¶
Ruffus verbosity is set using the -v
parameter from 1 to 10. The recommended setting for ERvsearch is -v 5, however this needs to be specified to override the ruffus default of 1.
e.g.
ERVsearch --target_tasks full -v 5
Main Output Files¶
The main output files produced are as follows. Many additional output files are generated, these are described here.
Screen¶
screen_results.dir/results.tsv
Table showing the ORFs identified with Exonerate and verified with UBLAST which meet the requirements specified in the pipeline.ini
file.
Columns are as follows:
name
Name assigned to the region consisting of the ID, chromosome, start and end positionsmatch
The most similar reference ORF to this ORF identified in theERVsearch/ERV_db
database using the ungapped Exonerate algorithm.perc_identity
The percentage identity between this sequence and the most similar reference ORF, based on the UBLAST output.*alignment_length
The length of the alignment of this sequence and the most similar reference ORF, based on the UBLAST output.evalue
The UBLAST e-value of the alignment of this sequence and the most similar reference ORF.bit_score
The UBLAST bit score of the alignment of this sequence and the most similar reference ORF.ID
ID assigned to this ORFchrom
Chromosome (or scaffold, contig or sequence) on which this ORF was identified.start
Start position of this ORF on the chromosome.end
End position of this ORF on the chromosome.strand
Positive sense (+) or negative sense (+)group
Local group to which the most similar reference ORF belongs. If the reference ORF is not in a group, this is genus_gene.genus
Genus to which the most similar reference ORF belongs.length
ORF length in nucleotides.gene
Retroviral gene - gag, pol or env.
screen_results.dir/by_length.FMT
Histograms of ORF lengths (in nucleotides) based on the results.tsv table, for the gag, pol and env genes.
screen_results.dir/by_genus.FMT
Bar charts showing the number of ORFs identified for each genus and gene based on the results.tsv table.
screen_results.dir/by_group.FMT
Bar charts showing the number of ORFs identified in each small subgroup of reference sequences for each retroviral gene. ORFs assigned as genus_gene were related to a reference sequence which is not in a smaller subgroup, based on the results.tsv table.
screen_results.dir/by_gene.FMT
Bar chart showing the number of ORFs identified for each gene, based on the results.tsv table.
Classify¶
summary_trees.dir/*FMT
(can be png, jpg, svg or pdf depending on the plots_format parameter).
Image files of the phylogenetic trees for each retroviral gene and genus. Different sized circles are used to show the relative size of collapsed monophyletic groups. Newly identified ERVs are highlighted.
summary_trees.dir/*tre
Tree files in Newick format for each retroviral gene and genus combining reference and newly identified sequences. Monophyletic groups of newly identified ORF sequences are represented by a single sequence.
ERVRegions¶
erv_regions_results.dir/results.tsv
Table summarising regions identified containing retrovirus-like ORFs from more than one gene.
Columns:
name - the final ID of the ERV region - the genes found plus an integer e.g. gag_pol_12
chrom - chromosome
start - start position of the ERV region
end - end position of the ERV region
strand - strand of the ERv region
genus - genus of the ERV region, can be multiple genera delimted by “|” if different genes had different genera
for each gene screened for (usually gag, pol and env)
GENE_name - the names of the ORFs for this gene in this region
GENE_ID - the original IDs of the ORFs for this gene in this region
GENE_start - the start position of this gene in this region (genome co-ordinates)
GENE_relative_start - the start position of this gene in this region (relative to the start of the region)
GENE_end - the end position of this gene in this region (genome co-ordinates)
GENE_relative_end - the end position of this gene in this region (relative to the start of the region)
GENE_strand - the strand for this gene in this region
GENE_match - the closest reference retrovirus to this gene in this region
GENE_group - the group of the closest reference retrovirus to this gene in this region
GENE_genus - the genus of the closest reference retrovirus to this gene in this region
orig_name - the name of the region in the input table
Parameters¶
All parameters should be specified in the pipeline.ini
configuration file
The template file can be found in ERvsearch/templates/pipeline.ini
.
Make a copy of this file in your working directory (the directory where you want to run the program and store the output) and change the values of the parameters according to your system and your needs.
Required parameters¶
These parameters should always be set (there are no default options).
paths
2.1 path_to_usearch
2.2 path_to_exonerate
Optional Parameters¶
The following parameters are optional (as they have default values)
genomesplits
1.1 split
1.2 split_n
1.3 forceoutput
2.1 outfile_stemdatabase
3.1 use_custom_db
3.2 gag
3.3 pol
3.4 envusearch
5.1 min_id
5.2 min_hit_length
5.3 min_coverageplots
6.1 dpi
6.2 format
6.3 gag_colour
6.4 pol_colour
6.5 env_colourORFs
7.1 min_orf_len
7.2 translation_tabletrees
8.1 use_gene_colour
8.2 maincolour
8.3 highlightcolour
8.4 outgroupcolour
8.5 dpi
8.6 format
genomesplits¶
Parameters for dividing the input genome into batches
If the genome has more than ~100 contigs or scaffolds, it is recommended to batch these rather than running Exonerate on each contig individually, to avoid creating an excessive number of output files
The default is to split the input into 50 batches, this is a manageable number for most systems.
split¶
string
True
or False
Default: True
If split is True the contigs will be batched, if it is False Exonerate will run once for every gene-contig combination (usually 3x number of contigs). If there are less contigs than batches this will be ignored
force¶
string
True
or False
Default: False
The pipeline will error if running using these genome split settings will run Exonerate more than 500 times. If you want to force the pipeline to run despite this, change force to True.
output¶
Output file parameters
outfile_stem¶
string
Default: ERVsearch
Log files will have this as a prefix (e.g. ERVsearch_log.txt)
database¶
Parameters concerning the database of reference ERV sequences
string
True
or False
use_custom_db¶
Default: False
When screening using Exonerate, query sequences are used to identify ERV like regions of the genome. It is possible to use the default ERV database provided with the pipeline (recommended) or to use a custom database.
False - use the default database
True - use a custom database
gag¶
string
Default: None
Path to a custom fasta file of gag amino acid sequences. To skip this gene use None as this value (only if use_custom_db is True)
pol¶
string
Default: None
Path to a custom fasta file of pol amino acid sequences. To skip this gene use None as this value (only if use_custom_db is True).
env¶
string
Default: None
Path to a custom fasta file of env amino acid sequences. To skip this gene use None as this value (only if use_custom_db is True).
exonerate¶
Exonerate parameters
min_hit_length¶
integer
Default: 100
Minimum Exonerate hit length on the chromosome. Shorter hits are filtered out.
overlap¶
integer
Default: 30
Maximum distance between chromosome regions identified with Exonerate which are merged into single regions.
min_score¶
integer
Default: 100
Minimum score in the second exonerate pass (with ungapped algorithm).
usearch¶
USEARCH parameters
min_id¶
float
Default: 0.5
Percentage identity used by the UBLAST algorithm. Used to set the -id
UBLAST parameter
min_hit_length¶
integer
Default: 100
Minimum hit length for UBLAST. Used to set the -mincols
UBLAST parameter
min_coverage¶
float
Default: 0.5
Minimum proportion of the query sequence which should be covered using UBLAST. Used to set the -query_cov
UBLAST parameter
plots¶
Plotting parameters
gag_colour¶
string
(hex colour code with #)
Default: #f38918
Colour for gag gene in summary plots. Default is orange.
pol_colour¶
string
(hex colour code with #)
Default: #4876f9
Colour for pol gene in summary plots. Default is blue.
env_colour¶
string
(hex colour code with #)
Default: #d61f54
Colour for env gene in summary plots. Default is pink
other_colour¶
string
(hex colour code with #)
Default: #33b54d
Colour for anything which doesn’t relate to a specific gene in summary plots. Default is green.
match_axes¶
string
True
or False
Default: False
If True, when gag, pol and env are shown as subplots on the same figure they should all have the same axis limits. If they do some can be very small but they are more comparable.
ORFs¶
Parameters for ORF identification
translation_table¶
integer
Default 1
Translation table to use when identifying ORFs - listed here https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi. Usually table 1 is fine - default plus alternative initiation codons
trees¶
Phylogenetic tree parameters
use_gene_colour¶
string
True
or False
Default: True
Use the colours specified in the plots section for each gene to highlight newly identified ERVs. If this is True
, highlightcolour
will be ignored and the gene colours will be used instead.If it is False
, highlightcolour
is used to highlight newly identified ERVs.
maincolour¶
string
(hex colour code with #)
Default: #000000
Main colour for text in tree images. Default is black.
highlightcolour¶
string
(hex colour code with #)
Default : #382bfe
Text colour for leaves highlighted in tree images - newly identified ERV-like regions. This is ignored if use_gene_colour
above is True
. Default is blue.
outgroupcolour¶
string
(hex colour code with #)
Default: #0e954b
Text colour for outgroup in tree images. Default is green.
dpi¶
integer
Default: 300
Dots per inch for phylogenetic tree images.
format¶
string
png
, svg
, pdf
or jpg
Default: png
File format for phylogenetic tree images, can be svg, png, pdf or jpg
regions¶
Parameters for defining regions with ORFs from more than one retroviral gene
Functions¶
1.Pipeline Schematic
2.Screen
3.Classify
4.ERVRegions
Pipeline Schematic¶
large version
../images/pipeline.png
NB: Where
GENE
is specified in a file path one file will be created for each gene - gag, pol and env (unless otherwise specified by the user).
Screen¶
Screens the genome for ERV like regions by comparing the genome to a set of known retroviral ORFs using Exonerate.
Confirms the Exonerate regions using UBLAST
Finds and confirms ORFs within these regions
Finds the most similar known retroviral ORF in the database to each of the newly identified ORFs
initiate¶
Input Files
pipeline.ini
Output Files
init.txt
Parameters
[genome] file
[paths] path_to_usearch
[paths] path_to_exonerate
Initialises the pipeline and checks that the required parameters in the pipeline.ini are set and valid and that the required software is in your $PATH.
Checks that:
The input genome file exists.
The correct path to ERVsearch is provided.
samtools, bedtools, FastTree and mafft are in the $PATH
The correct paths to usearch and exonerate are provided.
init.txt
is a placeholder to show that this step has been completed.
genomeToChroms¶
Input Files
[genome] file
keep_chroms.txt
Output Files
host_chromosomes.dir/*fasta
Parameters
[genome] file
[genomesplits] split
[genomesplits] split_n
[genomesplits] force
Splits the host genome provided by the user into FASTA files of a suitable size to run Exonerate efficiently.
If genomesplits_split
in the pipeline.ini is False, the genome is split into one fasta file for each sequence - each chromosome, scaffold or contig.
If genomesplits_split
in the pipeline.ini is True, the genome is split into the number of batches specified by the genomesplits_splitn
parameter, unless the total number of sequences in the input file is less than this number.
The pipeline will fail if the number of sequences which would result from the genomesplits settings would result in >500 Exonerate runs, however it is possible to force the pipeline to run despite this by setting genomesplits_force
to True.
If the file keep_chroms.txt exists in the working directory only chromosomes listed in this file will be kept.
An unzipped copy of zipped and gzipped fasta files will be created or a link to the file if it is already unzipped. this will be named genome.fa
and be in the working directory.
This function generates a series of fasta files which are stored in the host_chromosomes.dir directory.
prepDBs¶
Input Files
None
Output Files
gene_databases.dir/GENE.fasta
Parameters
[database] use_custom_db
[database] gag
[database] pol
[database] env
Retrieves the gag, pol and env amino acid sequence database fasta files and puts a copy of each gene_databases.dir directory.
If custom databases are used they are retrieved and named as gag.fasta pol.fasta, env.fasta so the path doesn’t need to be changed every time.
runExonerate¶
Input Files
gene_databases.dir/GENE.fasta
host_chromosomes.dir/*fasta
Output Files
raw_exonerate_output.dir/GENE_*.tsv
Parameters
[paths] path_to_exonerate
Runs the protein2dna
algorithm in the Exonerate software package with the host chromosomes (or other regions) in host_chromosomes.dir
as target sequences and the FASTA files from prepDBs as the query sequences.
The raw output of Exonerate is stored in the raw_exonerate_output directory, one file is created for each combination of query and target sequences.
This step is carried out with low stringency as results are later filtered using UBLAST and Exonerate.
cleanExonerate¶
Input Files
raw_exonerate_output.dir/GENE_*.tsv
Output_Files
clean_exonerate_output.dir/GENE_*_unfiltered.tsv
clean_exonerate_output.dir/GENE_*_filtered.tsv
clean_exonerate_output.dir/GENE_*.bed
Parameters
[exonerate] min_hit_length
Filters and cleans up the Exonerate output.
Converts the raw Exonerate output files into dataframes - GENE_unfiltered.tsv
Filters out any regions containing introns (as defined by Exonerate)
Filters out regions less than
exonerate_min_hit_length
on the host sequence (in nucleotides).Outputs the filtered regions to GENE_filtered.tsv
Converts this to bed format and outputs this to GENE.bed
mergeOverlaps¶
Input Files
clean_exonerate_output.dir/GENE_*.bed
Output_Files
gene_bed_files.dir/GENE_all.bed
gene_bed_files.dir/GENE_merged.bed
Parameters
[exonerate] overlap
Merges the output bed files for individual sections of the input genome into a single bed file.
Overlapping regions or very close together regions of the genome detected by Exonerate with similarity to the same retroviral gene are then merged into single regions. This is performed using bedtools merge on the bed files output by cleanExonerate.
If there is a gap of less than exonerate_overlap
between the regions they will be merged.
makeFastas¶
Input Files
gene_bed_files.dir/GENE_merged.bed
genome.fa
Output Files
gene_fasta_files.dir/GENE_merged.fasta
Parameters
None
Fasta files are generated containing the sequences of the merged regions of the genome identified using mergeOverlaps. These are extracted from the host chromosomes using bedtools getfasta.
renameFastas¶
Input Files
gene_fasta_files.dir/GENE_merged.fasta
Output Files
gene_fasta_files.dir/GENE_merged_renamed.fasta
Parameters
None
Renames the sequences in the fasta files of ERV-like regions identified with Exonerate so each record has a numbered unique ID (gag1, gag2 etc). Also removes “:” from sequence names as this causes problems later.
makeUBLASTDb¶
Input Files
gene_databases.dir/GENE.fasta
Output Files
UBLAST_db.dir/GENE_db.udb
Parameters
[paths] path_to_usearch
USEARCH requires an indexed database of query sequences to run. This function generates this database for the three gene amino acid fasta files used to screen the genome.
runUBLASTCheck¶
Input Files
UBLAST_db.dir/GENE_db.udb
gene_fasta_files.dir/GENE_merged_renamed.fasta
Output Files
ublast.dir/GENE_UBLAST_alignments.txt
ublast.dir/GENE_UBLAST.tsv
ublast.dir/GENE_filtered_UBLAST.fasta
Parameters
[paths] path_to_usearch
[usearch] min_id
[usearch] min_hit_length
[usearch] min_coverage
ERV regions in the fasta files generated by makeFasta are compared to the ERV amino acid database files for a second time, this time using USEARCH (https://www.drive5.com/usearch/). Using both of these tools reduces the number of false positives.
This allows sequences with low similarity to known ERVs to be filtered out. Similarity thresholds can be set in the pipeline.ini file (usearch_min_id,
- minimum identity between query and target - usearch_min_hit_length
- minimum length of hit on target sequence - and usearch_min_coverage
- minimum proportion of the query sequence the hit should cover).
The raw output of running UBLAST against the target sequences is saved in GENE_UBLAST_alignments.txt (equivalent to the BLAST default output) and GENE_UBLAST.tsv (equivalent to the BLAST -outfmt 6 tabular output) this is already filtered by passing the appropriate parameters to UBLAST. The regions which passed the filtering and are therefore in these output files are then output to a FASTA file GENE_filtered_UBLAST.fasta.
classifyWithExonerate¶
Input Files
ublast.dir/GENE_filtered_UBLAST.fasta
ERVsearch/ERV_db/all_ERVs_nt.fasta
Output Files
exonerate_classification.dir/GENE_all_matches_exonerate.tsv
exonerate_classification.dir/GENE_best_matches_exonerate.tsv
exonerate_classification.dir/GENE_refiltered_matches_exonerate.fasta
Parameters
[paths] path_to_exonerate
[exonerate] min_score
Runs the Exonerate ungapped algorithm with each ERV region in the fasta files generated by makeFasta as queries and the all_ERVs_nt.fasta fasta file as a target, to detect which known retrovirus is most similar to each newly identified ERV region. Regions which don’t meet a minimum score threshold (exonerate_min_score
) are filtered out.
all_ERVs_nt.fasta contains nucleic acid sequences for many known endogenous and exogenous retroviruses with known classifications.
First all seqeunces are compared to the database and the raw output is saved as exonerate_classification.dirGENE_all_matches_exonerate.tsv. Results need a score greater than exonerate_min_score
against one of the genes of the same type (gag, pol or env) in the database. The highest scoring result which meets these critera for each sequence is then identified and output to exonerate_classification.dir/GENE_best_matches_exonerate.tsv. The sequences which meet these critera are also output to a FASTA file exonerate_classification.dir/GENE_refiltered_exonerate.fasta.
getORFs¶
Input Files
exonerate_classification.dir/GENE_refiltered_matches_exonerate.fasta
Output Files
ORFs.dir/GENE_orfs_raw.fasta
ORFs.dir/GENE_orfs_nt.fasta
ORFs.dir/GENE_orfs_aa.fasta
Parameters
[orfs] translation_table
[orfs] min_orf_len
Finds the longest open reading frame in each of the ERV regions in the filtered output table.
This analysis is performed using EMBOSS revseq and EMBOSS transeq.
The sequence is translated in all six frames using the user specified translation table. The longest ORF is then identified. ORFs shorter than orfs_min_orf_length are filtered out.
The positions of the ORFs are also convered so that they can be extracted directly from the input sequence file, rather than using the co-ordinates relative to the original Exonerate regions.
The raw transeq output, the nucleotide sequences of the ORFs and the amino acid sequences of the ORFs are written to the output FASTA files.
checkORFsUBLAST¶
Input Files
ORFs.dir/GENE_orfs_nt.fasta
UBLAST_dbs.dir/GENE_db.udb
Output Files
ublast_orfs.dir/GENE_UBLAST_alignments.txt
ublast_orfs.dir/GENE_UBLAST.tsv
ublast_orfs.dir/GENE_filtered_UBLAST.fasta
Parameters
[paths] path_to_usearch
[usearch] min_id
[usearch] min_hit_length
[usearch] min_coverage
ERV ORFs in the fasta files generated by the ORFs function are compared to the original ERV amino acid files using UBLAST. This allows any remaining sequences with poor similarity to known ERVs to be filtered out.
This allows ORFs with low similarity to known ERVs to be filtered out. Similarity thresholds can be set in the pipeline.ini file (usearch_min_id,
- minimum identity between query and target - usearch_min_hit_length
- minimum length of hit on target sequence - and usearch_min_coverage
- minimum proportion of the query sequence the hit should cover).
The raw output of running UBLAST against the target sequences is saved in GENE_UBLAST_alignments.txt (equivalent to the BLAST default output) and GENE_UBLAST.tsv (equivalent to the BLAST -outfmt 6 tabular output) this is already filtered by passing the appropriate parameters to UBLAST. The regions which passed the filtering and are therefore in these output files are then output to a FASTA file GENE_filtered_UBLAST.fasta.
assignGroups¶
Input Files
ublast_orfs.dir/GENE_UBLAST.tsv
ERVsearch/ERV_db/convert.tsv
Output Files
grouped.dir/GENE_groups.tsv
Parameters
None
Many of the retroviruses in the input database all_ERVs_nt.fasta have been classified into groups based on sequence similarity, prior knowledge and phylogenetic clustering. Some sequences don’t fall into any well defined group, in these cases they are just assigned to a genus, usually based on prior knowledge. The information about these groups is stored in the provided file ERVsearch/ERV_db/convert.tsv.
Each sequence in the filtered fasta file of newly identified ORFs is assigned to one of these groups based on the sequence identified as the most similar in the classifyWithExonerate step.
The output table is also tidied up to include the UBLAST output, chromosome, ORF start and end positions, genus and group.
summariseScreen¶
Input Files
gene_bed_files.dir/GENE_merged.bed
ublast.dir/GENE_UBLAST.tsv
ORFs.dir/GENE_orfs_aa.fasta
ublast_orfs.dir/GENE_UBLAST.tsv
Output Files
FMT can be png, svg, pdf, jpg depending on the plot_format
parameter
summary_tables.dir/exonerate_initial_summary.txt
summary_tables.dir/ublast_hits_initial_summary.txt
summary_tables.dir/orfs_initial_summary.txt
summary_tables.dir/ublast_orfs_initial_summary.txt
summary_plots.dir/exonerate_initial_lengths.FMT
summary_plots.dir/exonerate_initial_scores.FMT
summary_plots.dir/exonerate_initial_strands.FMT
summary_plots.dir/exonerate_initial_by_sequence.FMT
summary_plots.dir/exonerate_initial_counts_per_gene.FMT
summary_plots.dir/ublast_hits_alignment_length.FMT
summary_plots.dir/ublast_hits_perc_similarity.FMT
summary_plots.dir/ublast_hits_by_match.FMT
summary_plots.dir/ublast_hits_per_gene.FMT
summary_plots.dir/orfs_lengths.FMT
summary_plots.dir/orfs_strands.FMT
summary_plots.dir/orfs_by_gene.FMT
summary_plots.dir/ublast_orfs_alignment_length.FMT
summary_plots.dir/ublast_orfs_perc_similarity.FMT
summary_plots.dir/ublast_orfs_bit_score.FMT
summary_plots.dir/ublast_orfs_by_match.FMT
summary_plots.dir/ublast_orfs_per_gene.FMT
screen_results.dir/results.tsv
screen_results.dir/by_length.FMT
screen_results.dir/by_genus.FMT
screen_results.dir/by_group.FMT
screen_results.dir/by_gene.FMT
Parameters
[plots] dpi
[plots] format
[plots] gag_colour
[plots] pol_colour
[plots] env_colour
[plots] other_colour
[plots] match_axes
Generates a series of summary plots and tables showing the results of running the screening functions.
The major outputs of this function are stored in the screen_results.dir directory. Further details of these files are provided in the Main Output Files section.
The other files show the output of the intermediate steps.
Exonerate Initial
summary_tables.dir/exonerate_initial_summary.txt
Summary of the output of the initial Exonerate screening step. Note that these are unfiltered and many will not be true ERVs.summary_tables.dir/ublast_hits_initial_summary.txt
Summary of the results of running UBLAST on the initial Exonerate output.summary_tables.dir/orfs_initial_summary.txt
Summary of the results of the initial ORF identification.summary_tables.dir/ublast_orfs_initial_summary.txt
Summary of the results of running UBLAST on these ORFs.summary_plots.dir/exonerate_initial_lengths.FMT
Histogram showing the lengths of the initial Exonerate regions for each gene.summary_plots.dir/exonerate_initial_scores.FMT
Histogram showing the Exonerate score of the initial Exonerate regions for each gene.summary_plots.dir/exonerate_initial_strands.FMT
Bar chart showing the number of regions identified on each strand in the initial Exonerate screen.summary_plots.dir/exonerate_initial_by_sequence.FMT
Histogram showing the number of ERV-like regions identified on each sequence in the reference genome being screened.summary_plots.dir/exonerate_initial_counts_per_gene.FMT
Bar chart showing the number of ERV regions identified per gene in the initial Exonerate screen.
UBLAST
summary_plots.dir/ublast_hits_alignment_length.FMT
Histogram showing the lengths of the alignments of the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_perc_similarity.FMT
Histogram showing the percentage identity between the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_perc_similarity.FMT
Histogram showing the UBLAST bit score between the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_by_match.FMT
Bar chart showing the number of UBLAST filtered Exonerate regions most similar to each reference ORF in the ERVsearch/ERV_db database.summary_plots.dir/ublast_hits_per_gene.FMT
Bar chart showing the number of UBLAST filtered Exonerate regions identified per gene.
ORFs
summary_plots.dir/orfs_lengths.FMT
Histogram of the lengths of ORFs identified in the ERV regions.summary_plots.dir/orfs_strands.FMT
Bar chart of the strand (positive (+) or negative (-) sense) of the ORFs identified in the ERV regions.summary_plots.dir/orfs_by_gene.FMT
Bar chart of the number of ORFs identified for each gene.
UBLAST ORFs
summary_plots.dir/ublast_orfs_alignment_length.FMT
Histogram showing the lengths of the alignments of the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_perc_similarity.FMT
Histogram showing the percentage identity between the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_bit_score.FMT
Histogram showing the UBLAST bit score between the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_by_match.FMT
Bar chart showing the number of ERV-like ORFs most similar to each reference ORF in the ERVsearch/ERV_db database.summary_plots.dir/ublast_orfs_per_gene.FMT
Bar chart showing the number of ERV-like ORFs identified per gene.
Screen¶
Input Files
None
Output Files
None
Parameters
None
Helper function to run all screening functions (all functions prior to this point).
Classify¶
Classifies the newly identified ORFs into groups based on the most similar known ORF
Aligns the newly identified ORFs with reference sequences within these groups and builds a phylogenetic tree for each group.
Finds clusters of newly identified ORFs within these trees
Incorporates representative sequences from these clusters into a summary tree for each retroviral gene and genus (based on classification into gamma, beta, spuma, alpha, lenti, epsilon and delta retroviruses as defined by the ICTV.
makeGroupFastas¶
Input Files
grouped.dir/GENE_groups.tsv
ERVsearch/phylogenies/group_phylogenies/*fasta
ERVsearch/phylogenies/summary_phylogenies/*fasta
ERVsearch/phylogenies/outgroups.tsv
Output Files
group_fastas.dir/GENE_(.*)_GENUS.fasta
group_fastas.dir/GENE_(.*)_GENUS_A.fasta
Parameters
None
Two sets of reference fasta files are available (files are stored in ERVsearch/phylogenies/group_phylogenies
and ERVsearch/phylogenies/summary_phylogenies
)
group_phylogenies - groups of closely related ERVs for fine classification of sequences
summary_phylogenies - groups of most distant ERVs for broad classification of sequences
Sequences have been assigned to groups based on the most similar sequence in the provided ERV database, based on the score using the Exonerate ungapped algorithm. Where the most similar sequence is not part of a a well defined group, it has been assigned to a genus.
Fasta files are generated containing all members of the group from the group_phylogenies file (plus an outgroup) where possible and using representative sequences from the same genus, using the summary_phylogenies file, where only a genus has been assigned, plus all the newly identified ERVs in the group. These files are saved as GENE_(group_name_)GENUS.fasta.
A “~” is added to all new sequence names so they can be searched for easily.
The files are aligned using the MAFFT fftns algorithm https://mafft.cbrc.jp/alignment/software/manual/manual.html to generate the GENE_(group_name_)GENUS_A.fasta aligned output files.
makeGroupTrees¶
Input Files
group_fastas.dir/GENE_(.*_)GENUS_A.fasta
Output Files
group_trees.dir/GENE_(.*_)GENUS.tre
Parameters
None
Builds a phylogenetic tree, using the FastTree2 algorithm (http://www.microbesonline.org/fasttree) with the default settings plus the GTR model, for the aligned group FASTA files generated by the makeGroupFastas function.
drawGroupTrees¶
Input Files
group_trees.dir/GENE_(.*_)GENUS.tre
Output Files
group_trees.dir/GENE_(.*_)GENUS.FMT
(png, svg, pdf or jpg)
Parameters
[plots] gag_colour
[plots] pol_colour
[plots] env_colour
[trees] use_gene_colour
[trees] maincolour
[trees] highlightcolour
[trees] outgroupcolour
[trees] dpi
[trees] format
Generates an image file for each file generated in the makeGroupTrees step, using ete3 (http://etetoolkit.org). Newly identified sequences are labelled as “~” and shown in a different colour.
By default, newly identified sequences are shown in the colours specified in plots_gag_colour
, plots_pol_colour
and plots_env_colour
- to do this then trees_use_gene_colour
should be set to True in the pipeline.ini
. Alternatively, a fixed colour can be used by setting trees_use_gene_colour
to False and settings trees_highlightcolour
. The text colour of the reference sequences (default black) can be set using trees_maincolour
and the outgroup using trees_outgroupcolour
.
The output file DPI can be specified using trees_dpi
and the format (which can be png, svg, pdf or jpg) using trees_format
.
makeSummaryFastas¶
Input Files
group_fastas.dir/GENE_(.*_)GENUS.fasta
group_trees.dir/GENE_(*_)GENUS.tre
ERVsearch/phylogenies/summary_phylogenies/GENE_GENUS.fasta
ERVsearch/phylogenies/group_phylogenies/(.*)_GENUS_GENE.fasta
Output Files
summary_fastas.dir/GENE_GENUS.fasta
summary_fastas.dir/GENE_GENUS.tre
Parameters
None
Based on the group phylogenetic trees generated in makeGroupTrees, monophyletic groups of newly idenified ERVs are identified. For each of these groups, a single sequence (the longest) is selected as representative. The representative sequences are combined with the FASTA files in ERVsearch/phylogenies/summary_phylogenies
, which contain representative sequences for each retroviral gene and genus. These are extended to include further reference sequences from the same small group as the newly identified sequences.
For example, if one MLV-like pol and one HERVF-like pol was identified in the gamma genus, the gamma_pol.fasta summary fasta would contain:
* The new MLV-like pol sequence
* The new HERVF-like pol sequence
* The reference sequences from ERVsearch/phylogenies/group_phylogenies/MLV-like_gamma_pol.fasta
- highly related sequences from the MLV-like group
* The reference sequences from ERVsearch/phylogenies/group_phylogenies/HERVF-like_gamma_pol.fasta
- highly related sequences from the HERVF-like group.
* The reference sequences from ERVsearch/phylogenies/summary_phylogenies/gamma_pol.fasta
- a less detailed but more diverse set of gammaretroviral pol ORFs.
* A epsilonretrovirus outgroup
This ensures sufficient detail in the groups of interest while avoiding excessive detail in groups where nothing new has been identified.
These FASTA files are saved as GENE_GENUS.fasta
The files are aligned using the MAFFT fftns algorithm https://mafft.cbrc.jp/alignment/software/manual/manual.html to generate the GENE_GENUS_A.fasta aligned output files.
makeSummaryTrees¶
Input Files
summary_fastas.dir/GENE_GENUS_A.fasta
Output Files
summary_trees.dir/GENE_GENUS.tre
Parameters
None
Builds a phylogenetic tree, using the FastTree2 algorithm (http://www.microbesonline.org/fasttree) with the default settings plus the GTR model, for the aligned group FASTA files generated by the makeSummaryFastas function.
drawSummaryTrees¶
Input Files
summary_trees.dir/GENE_GENUS.tre
Output Files
summary_trees.dir/GENE_GENUS.FMT
(FMT = png, svg, pdf or jpg)
Parameters
[plots] gag_colour
[plots] pol_colour
[plots] env_colour
[trees] use_gene_colour
[trees] maincolour
[trees] highlightcolour
[trees] outgroupcolour
[trees] dpi
[trees] format
Generates an image file for each file generated in the makeSummaryTrees step, using ete3 (http://etetoolkit.org). Newly identified sequences are labelled as “~” and shown in a different colour. Monophyletic groups of newly identified ERVs have been collapsed (by choosing a single representative sequence) and the number of sequences in the group is added to the label and represented by the size of the node tip.
By default, newly identified sequences are shown in the colours specified in plots_gag_colour
, plots_pol_colour
and plots_env_colour
- to do this then trees_use_gene_colour
should be set to True in the pipeline.ini
. Alternatively, a fixed colour can be used by setting trees_use_gene_colour
to False and settings trees_highlightcolour
. The text colour of the reference sequences (default black) can be set using trees_maincolour
and the outgroup using trees_outgroupcolour
.
The output file DPI can be specified using trees_dpi
and the format (which can be png, svg, pdf or jpg) using trees_format
.
summariseClassify¶
Input Files
summary_fastas.dir/GENE_GENUS.fasta
summary_trees.dir/GENE_GENUS.tre
Output Files
classify_results.dir/results.tsv
classify_results.dir/by_gene_genus.FMT
(png, svg, pdf or jpg)
Parameters
None
Combines the results of the classify steps to generate additional summary files. The results.tsv
output file lists the number of genes which have been collapsed into each group in the trees in the summary_trees.dir
directory. The by_gene_genus.FMT plot is a bar chart of the same information, organised by gene and genus.
Classify¶
Input Files None
Output Files None
Parameters None
Helper function to run all screening functions and classification functions (all functions prior to this point).
ERVRegions¶
Identifies regions of the genome containing ORFs resembling more than one different retroviral gene within a certain distance
makeCleanBeds¶
Input Files
grouped.dir/GENE_groups.tsv
Output Files
clean_beds.dir/GENE.bed
Parameters
None
Generates a bed file for each gene which contains the co-ordinates of the ORFs which have passed all filtering criteria in the Screen section.
makeCleanFastas¶
Input Files
clean_beds.dir/GENE.bed
genome.fa
Output Files
clean_fastas.dir/GENE.fasta
Parameters
None
Fasta files are generated containing the sequences of the regions listed by makeCleanBeds. These are extracted from the host chromosomes using bedtools getfasta (https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html).
findERVRegions¶
Input Files
clean_fastas.dir/*.fasta
Output Files
ERV_regions.dir/all_ORFs.bed
ERV_regions.dir/all_regions.bed
ERV_regions.dir/multi_gene_regions.bed
ERV_regions.dir/regions.fasta
Parameters
[regions] maxdist
Combines the files containng the ORF regions for the different retroviral genes and merges any regions which are within regions_maxdist
of each other to find larger regions containing multiple genes.
The all_ORFs.bed
output file is the concatenated and sorted bed files, all_regions.bed
contains the merged regions with any ORFs within regions_maxdist of each other (end to end) combined, plus all regions with a single ORF, generated from all_regions.bed using bedtools merge (https://bedtools.readthedocs.io/en/latest/content/tools/merge.html). The name, strand and score columns are concatenated for merged regions, delimited with a “,”.
multi_gene_regions.bed
contains only the regions which were found to contain multiple ORFs, regions.fasta
is the sequence of these regions in FASTA format. At this point this includes regions with multiple ORFs from the same gene (e.g. two pol ORFs).
makeRegionTables¶
Input Files
ERV_regions.dir/multi_gene_regions.bed
grouped.dir/*_groups.tsv
genome.fa
Output Files
ERV_regions.dir/ERV_regions_final.tsv
ERV_regions.dir/ERV_regions_final.bed
ERV_regions.dir/ERV_regions_final.fasta
Parameters
[regions] maxoverlap
Takes a merged bed file consisting of regions of the genome identified as having more than one ERV-like ORF, finds the regions within this file which contain more than one different gene (e.g. gag and pol instead of two gag ORFs) and outputs a formatted table of information about these regions.
The output table (ERV_regions_final.tsv
) will usually have 37 columns:
name
- the final ID of the ERV region - the genes found plus an integer e.g. gag_pol_12chrom
- chromosomestart
- start position of the ERV regionend
- end position of the ERV regionstrand
- strand of the ERv regiongenus
- genus of the ERV region, can be multiple genera delimted by “|” if different genes had different generafor each gene screened for (usually gag, pol and env)
GENE_name
- the names of the ORFs for this gene in this regionGENE_ID
- the original IDs of the ORFs for this gene in this regionGENE_start
- the start position of this gene in this region (genome co-ordinates)GENE_relative_start
- the start position of this gene in this region (relative to the start of the region)GENE_end
- the end position of this gene in this region (genome co-ordinates)GENE_relative_end
- the end position of this gene in this region (relative to the start of the region)GENE_strand
- the strand for this gene in this regionGENE_match
- the closest reference retrovirus to this gene in this regionGENE_group
- the group of the closest reference retrovirus to this gene in this regionGENE_genus
- the genus of the closest reference retrovirus to this gene in this region
orig_name
- the name of the region in the input table
If not all genes are screened for the table will not have the columns for this gene.
A bed file (ERV_regions_final.bed
) is generated with the co-ordinates of the identified regions and a FASTA file (ERV_regions_final.fasta
) containing their sequences.
plotERVRegions¶
Input_Files
ERV_regions.dir/ERV_regions_final.tsv
Output_Files
ERV_region_plots.dir/*FMT
Parameters
[plots] format
[plots] dpi
[plots] gag_colour
[plots] pol_colour
[plots] env_colour
For each region containing ORFs resembling more than one retroviral gene, a plot is generated showing how these ORFs are distributed on the genome relative to each other.
Each gene is shown on a different line on the y axis, the x axis is chromosome co-ordinates.
summariseERVRegions¶
Input Files
ERV_regions.dir/ERV_regions_final.tsv
Output Files
erv_regions_results.dir/results.tsv
[erv_regions_results.dir/erv_regions.FMT
](introduction.html#id3
Parameters
[plots] other_colour
Combines the results of the ERVregions steps to generate additional summary files.
The results.tsv
output file is a copy of the output of the makeRegionTables functions.
This will usually have 37 columns:
name
- the final ID of the ERV region - the genes found plus an integer e.g. gag_pol_12chrom
- chromosomestart
- start position of the ERV regionend
- end position of the ERV regionstrand
- strand of the ERv regiongenus
- genus of the ERV region, can be multiple genera delimted by “|” if different genes had different generafor each gene screened for (usually gag, pol and env)
GENE_name
- the names of the ORFs for this gene in this regionGENE_ID
- the original IDs of the ORFs for this gene in this regionGENE_start
- the start position of this gene in this region (genome co-ordinates)GENE_relative_start
- the start position of this gene in this region (relative to the start of the region)GENE_end
- the end position of this gene in this region (genome co-ordinates)GENE_relative_end
- the end position of this gene in this region (relative to the start of the region)GENE_strand
- the strand for this gene in this regionGENE_match
- the closest reference retrovirus to this gene in this regionGENE_group
- the group of the closest reference retrovirus to this gene in this regionGENE_genus
- the genus of the closest reference retrovirus to this gene in this region
orig_name
- the name of the region in the input table
A bar chart - erv_regions_results.dir/erv_regions.FMT
is also generated showing the number of ERV regions found with each combination of genes.
ERVRegions¶
Input Files None
Output Files None
Parameters None
Helper function to run all screening functions and ERVRegions functions.
Minor Output Files¶
1.Screen
2.Classify
3.ERVRegions
Screen¶
initiate¶
init.txt
Placeholder file to show initial checks have been run.
genomeToChroms¶
host_chromosomes.dir/*fasta
FASTA format files containing the input genome (or other sequence), divided into regions based on the genomesplits parameters.
prepDBs¶
gene_databases.dir/GENE.fasta
Copies of the fasta files of reference retroviral amino acid sequences.
runExonerate¶
raw_exonerate_output.dir/GENE_*.tsv
Raw “vulgar” output of Exonerate as described here
cleanExonerate¶
clean_exonerate_output.dir/GENE_*_unfiltered.tsv
Raw exonerate output converted into a table.
Columns:
query_id
Reference amino acid sequence ID for retroviral genequery_start
Start position of match within reference sequencequery_end
End position of match within reference sequencequery_strand
Strand of match relative to reference sequencetarget_id
Nucleotide sequence from input which matched the retroviral gene.target_start
Start position of match within input sequence.target_end
End position of match within output sequence.target_strand
Strand of match relative to input sequencescore
Exonerate score of matchdetails
Additional columns (columns 11+) from Exonerate vulgar output, delimited by “|”.length
Length of match on input sequence
clean_exonerate_output.dir/GENE_*_filtered.tsv
Output of Exonerate filtered to remove regions containing introns and regions which are shorter than exonerate_min_hit_length. Columns are the same as in the unfiltered table.
clean_exonerate_output.dir/GENE_*.bed
ERV-like regions from the table above in bed
mergeOverlaps¶
gene_bed_files.dir/GENE_all.bed
,
ERV-like regions from clean_exonerate_output.dir/GENE_*.bed
combined into a single bed file.
gene_bed_files.dir/GENE_merged.bed
ERV-like regions from the previous file merged using Bedtools merge
makeFastas¶
gene_fasta_files.dir/GENE_merged.fasta
Fasta files of the merged regions in gene_bed_files.dir/GENE_merged.bed
renameFastas¶
gene_fasta_files.dir/GENE_merged_renamed.fasta
Fasta files from the makeFastas
step with the ERV-like regions renamed with unique IDs.
makeUBLASTDb¶
UBLAST_db.dir/GENE_db.udb
UBLAST databases for the reference retroviral amino acid sequences.
runUBLASTCheck¶
ublast.dir/GENE_UBLAST_alignments.txt
Raw UBLAST output files for the Exonerate regions vs the retroviral amino acid databases. Equivalent to the BLAST pairwise output.
ublast.dir/GENE_UBLAST.tsv
UBLAST tabular output for Exonerate regions vs the retrovrial amino acid databases. Equivalent to the BLAST tabular output.
ublast.dir/GENE_filtered_UBLAST.fasta
Fasta file of the regions which passed the UBLAST filter.
classifyWithExonerate¶
exonerate_classification.dir/GENE_all_matches_exonerate.tsv
Raw output of ungapped Exonerate algorithm for the UBLAST verified regions against the ERV amino acid database.
Columns:
ID of newly identified ERV-like region
ID of reference ERV amino acid
Exonerate score
exonerate_classification.dir/GENE_best_matches_exonerate.tsv
The previous table filtered to list only the highest scoring hit for each ERV-like region.
exonerate_classification.dir/GENE_refiltered_matches_exonerate.fasta
Highest scoring hits from the previous table in FASTA format.
getORFs¶
ORFs.dir/GENE_orfs_raw.fasta
Raw output of running EMBOSS transeq -frame 6 on the regions output from classifyWithExonerate
.
ORFs.dir/GENE_orfs_nt.fasta
ORFs longer than ORFs_min_orf_length as nucleotide sequences, with IDs redefined to include the chromosome, start and end position and strand of the ORF.
ORFs.dir/GENE_orfs_aa.fasta
ORFs longer than ORFs_min_orf_length as amino acid sequences, with IDs redefined to include the chromosome, start and end position and strand of the ORF.
checkORFsUBLAST¶
ublast_orfs.dir/GENE_UBLAST_alignments.txt
Raw UBLAST output files for the newly identified ORFs vs the retroviral amino acid databases. Equivalent to the BLAST pairwise output.
ublast_orfs.dir/GENE_UBLAST.tsv
UBLAST tabular output for newly identified ORFs vs the retrovrial amino acid databases. Equivalent to the BLAST tabular output.
ublast_orfs.dir/GENE_filtered_UBLAST.fasta
Fasta file of the newly identified ORFs which passed the UBLAST filter.
assignGroups¶
grouped.dir/GENE_groups.tsv
Summarised output for the previous steps. This is identical to screen_results.dir/results.tsv
.
summariseScreen¶
FMT can be png, svg, pdf, jpg depending on the plot_format
parameter
The major outputs of this function are stored in the screen_results.dir directory. Further details of these files are provided in the Main Output Files section.
The other files show the output of the intermediate steps.
Exonerate Initial
summary_tables.dir/exonerate_initial_summary.txt
Summary of the output of the initial Exonerate screening step. Note that these are unfiltered and many will not be true ERVs.summary_tables.dir/ublast_hits_initial_summary.txt
Summary of the results of running UBLAST on the initial Exonerate output.summary_tables.dir/orfs_initial_summary.txt
Summary of the results of the initial ORF identification.summary_tables.dir/ublast_orfs_initial_summary.txt
Summary of the results of running UBLAST on these ORFs.summary_plots.dir/exonerate_initial_lengths.FMT
Histogram showing the lengths of the initial Exonerate regions for each gene.summary_plots.dir/exonerate_initial_scores.FMT
Histogram showing the Exonerate score of the initial Exonerate regions for each gene.summary_plots.dir/exonerate_initial_strands.FMT
Bar chart showing the number of regions identified on each strand in the initial Exonerate screen.summary_plots.dir/exonerate_initial_by_sequence.FMT
Histogram showing the number of ERV-like regions identified on each sequence in the reference genome being screened.summary_plots.dir/exonerate_initial_counts_per_gene.FMT
Bar chart showing the number of ERV regions identified per gene in the initial Exonerate screen.
UBLAST
summary_plots.dir/ublast_hits_alignment_length.FMT
Histogram showing the lengths of the alignments of the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_perc_similarity.FMT
Histogram showing the percentage identity between the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_perc_similarity.FMT
Histogram showing the UBLAST bit score between the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_by_match.FMT
Bar chart showing the number of UBLAST filtered Exonerate regions most similar to each reference ORF in the ERVsearch/ERV_db database.summary_plots.dir/ublast_hits_per_gene.FMT
Bar chart showing the number of UBLAST filtered Exonerate regions identified per gene.
ORFs
summary_plots.dir/orfs_lengths.FMT
Histogram of the lengths of ORFs identified in the ERV regions.summary_plots.dir/orfs_strands.FMT
Bar chart of the strand (positive (+) or negative (-) sense) of the ORFs identified in the ERV regions.summary_plots.dir/orfs_by_gene.FMT
Bar chart of the number of ORFs identified for each gene.
UBLAST ORFs
summary_plots.dir/ublast_orfs_alignment_length.FMT
Histogram showing the lengths of the alignments of the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_perc_similarity.FMT
Histogram showing the percentage identity between the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_bit_score.FMT
Histogram showing the UBLAST bit score between the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_by_match.FMT
Bar chart showing the number of ERV-like ORFs most similar to each reference ORF in the ERVsearch/ERV_db database.summary_plots.dir/ublast_orfs_per_gene.FMT
Bar chart showing the number of ERV-like ORFs identified per gene.
Classify¶
makeGroupFastas¶
group_fastas.dir/GENE_(.*)_GENUS.fasta
Fasta files for each small subgroup of ERV-like ORFs and reference sequences.
group_fastas.dir/GENE_(.*)_GENUS_A.fasta
Aligned version of the above Fasta file generated using MAFFT
makeGroupTrees¶
`group_trees.dir/GENE_(.*)_GENUS.tre
Phylogenetic trees in Newick format for each small subgroup of ERV-like ORFs and reference sequences generated using FastTree
drawGroupTrees¶
group_trees.dir/GENE_(.*_)GENUS.FMT
(png, svg, pdf or jpg)
Image files of the phylogenetic trees for each small subgroup of ERV-like ORFs and reference sequences, with newly identified sequences highlighted.
makeSummaryFastas¶
summary_fastas.dir/GENE_GENUS.fasta
Fasta files for each retroviral gene and genus combining reference and newly identified sequences. Monophyletic groups of newly identified ORfs sequences are represented by a single sequence.
group_lists.dir/*tsv
Lists of sequences in each monophyletic group of newly identified ORFs which has been collated to be represented by a single sequnce.
makeSummaryTrees¶
summary_trees.dir/GENE_GENUS.tre
Tree files in Newick format for each retroviral gene and genus combining reference and newly identified sequences. Monophyletic groups of newly identified ORfs sequences are represented by a single sequence.
drawSummaryTrees¶
summary_trees.dir/GENE_GENUS.FMT
(FMT = png, svg, pdf or jpg)
Images files of the phylogenetic trees for each retroviral gene and genus. Different sized circles are used to show the relative size of collapsed monophyletic groups. Newly identified ERVs are highlighted.
summariseClassify¶
classify_results.dir/results.tsv
Table listing the number of genes which have been collapsed into each monophyletic group in the trees in the summary_trees.dir
directory.
Columns:
gene Retroviral gene for this group
genus Retroviral genus for this group
group Group ID
count Number of sequences in this group
classify_results.dir/by_gene_genus.png
Bar chart showing the number of genes which have been collapsed into each monophyletic group, organised by gene and genus.
ERVRegions¶
makeCleanBeds¶
clean_beds.dir/GENE.bed
Bed files containing the co-ordinates all the ERV-like ORFs output by the Screen section.
makeCleanFastas¶
clean_fastas.dir/GENE.fasta
FASTA files of the ERV-like ORFs output by the Screen section.
findERVRegions¶
ERV_regions.dir/all_ORFs.bed
Concatenated version of the bed files output by the makeCleanBeds
function with all three genes in a single file.
ERV_regions.dir/all_regions.bed
Previous bed file merged using Bedtools merge, with regions within regions_maxdist merged.
ERV_regions.dir/multi_gene_regions.bed
Filtered version of the previous bed file with only regions which were merged.
ERV_regions.dir/regions.fasta
Fasta file of the merged regions.
makeRegionTables¶
ERV_regions.dir/ERV_regions_final.tsv
Table showing the details of the combined regions containing ORFs resembling more than one retroviral gene. This table is identical to erv_region_results.dir/results.tsv
.<br<
ERV_regions.dir/ERV_regions_final.bed
Bed file with the co-ordinates of the identified regions.
ERV_regions.dir/ERV_regions_final.fasta
FASTA file of the regions in the bed file above.
plotERVRegions¶
ERV_region_plots.dir/*.FMT
Plots showing the distributions of ORFs resembling each retroviral gene on the genome. Each gene is shown on a different line on the y axis, the x axis is chromosome co-ordinates. One plot is generated for each multi-gene region.
summariseERVRegions¶
erv_region_results.dir/results.tsv
Table showing the overall results for regions with ORFs resembling multiple retroviral genes. This table is described in full in the Main Output Files section.
erv_region_results.dir/erv_regions.png
Bar chart showing the number of ERV regions identified with each combination of retroviral genes.