Parameters¶

All parameters should be specified in the pipeline.ini configuration file

The template file can be found in ERvsearch/templates/pipeline.ini.

Make a copy of this file in your working directory (the directory where you want to run the program and store the output) and change the values of the parameters according to your system and your needs.

Required parameters¶

These parameters should always be set (there are no default options).

genome
1.1. file
paths
2.1 path_to_usearch
2.2 path_to_exonerate

genome¶

Input file parameters

file¶

string

Path to a single fasta file containing the genome or other sequence you would like to screen for ORFs

e.g. /home/katy/genomes/hg38.fasta

paths¶

Paths to software

path_to_usearch¶

string

Path to usearch executable

e.g. /usr/bin/usearch11.0.667_u86linux32

path_to_exonerate¶

‘string’

Path to exonerate executable

e.g. /usr/bin/exonerate

Optional Parameters¶

The following parameters are optional (as they have default values)

genomesplits
1.1 split
1.2 split_n
1.3 force
output
2.1 outfile_stem
database
3.1 use_custom_db
3.2 gag
3.3 pol
3.4 env
exonerate
4.1 overlap
4.2 min_score
usearch
5.1 min_id
5.2 min_hit_length
5.3 min_coverage
plots
6.1 dpi
6.2 format
6.3 gag_colour
6.4 pol_colour
6.5 env_colour
ORFs
7.1 min_orf_len
7.2 translation_table
trees
8.1 use_gene_colour
8.2 maincolour
8.3 highlightcolour
8.4 outgroupcolour
8.5 dpi
8.6 format
regions
9.1 maxdist

genomesplits¶

Parameters for dividing the input genome into batches

If the genome has more than ~100 contigs or scaffolds, it is recommended to batch these rather than running Exonerate on each contig individually, to avoid creating an excessive number of output files

The default is to split the input into 50 batches, this is a manageable number for most systems.

split¶

string True or False
Default: True

If split is True the contigs will be batched, if it is False Exonerate will run once for every gene-contig combination (usually 3x number of contigs). If there are less contigs than batches this will be ignored

split_n¶

integer
Default: 50

Number of batches to split the contigs into

force¶

string True or False
Default: False

The pipeline will error if running using these genome split settings will run Exonerate more than 500 times. If you want to force the pipeline to run despite this, change force to True.

output¶

Output file parameters

outfile_stem¶

string
Default: ERVsearch

Log files will have this as a prefix (e.g. ERVsearch_log.txt)

database¶

Parameters concerning the database of reference ERV sequences string True or False

use_custom_db¶

Default: False

When screening using Exonerate, query sequences are used to identify ERV like regions of the genome. It is possible to use the default ERV database provided with the pipeline (recommended) or to use a custom database.

False - use the default database

True - use a custom database

gag¶

string
Default: None

Path to a custom fasta file of gag amino acid sequences. To skip this gene use None as this value (only if use_custom_db is True)

pol¶

string
Default: None

Path to a custom fasta file of pol amino acid sequences. To skip this gene use None as this value (only if use_custom_db is True).

env¶

string
Default: None

Path to a custom fasta file of env amino acid sequences. To skip this gene use None as this value (only if use_custom_db is True).

exonerate¶

Exonerate parameters

min_hit_length¶

integer
Default: 100

Minimum Exonerate hit length on the chromosome. Shorter hits are filtered out.

overlap¶

integer
Default: 30

Maximum distance between chromosome regions identified with Exonerate which are merged into single regions.

min_score¶

integer
Default: 100

Minimum score in the second exonerate pass (with ungapped algorithm).

usearch¶

USEARCH parameters

min_id¶

float
Default: 0.5

Percentage identity used by the UBLAST algorithm. Used to set the -id UBLAST parameter

min_hit_length¶

integer
Default: 100

Minimum hit length for UBLAST. Used to set the -mincols UBLAST parameter

min_coverage¶

float
Default: 0.5

Minimum proportion of the query sequence which should be covered using UBLAST. Used to set the -query_cov UBLAST parameter

plots¶

Plotting parameters

dpi¶

integer
Default: 300

Dots per inch for output plots (from the summarise functions).

format¶

string png, svg, pdf or jpg

File format for summary plot files, can be svg, png, pdf or jpg

gag_colour¶

string (hex colour code with #)
Default: #f38918

Colour for gag gene in summary plots. Default is orange.

pol_colour¶

string (hex colour code with #)
Default: #4876f9

Colour for pol gene in summary plots. Default is blue.

env_colour¶

string (hex colour code with #)
Default: #d61f54

Colour for env gene in summary plots. Default is pink

other_colour¶

string (hex colour code with #)
Default: #33b54d

Colour for anything which doesn’t relate to a specific gene in summary plots. Default is green.

match_axes¶

string True or False
Default: False

If True, when gag, pol and env are shown as subplots on the same figure they should all have the same axis limits. If they do some can be very small but they are more comparable.

ORFs¶

Parameters for ORF identification

min_orf_len¶

integer
Default: 100

Minimum length of ORFs to characterise.

translation_table¶

integer
Default 1

Translation table to use when identifying ORFs - listed here https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi. Usually table 1 is fine - default plus alternative initiation codons

trees¶

Phylogenetic tree parameters

use_gene_colour¶

string True or False
Default: True

Use the colours specified in the plots section for each gene to highlight newly identified ERVs. If this is True, highlightcolour will be ignored and the gene colours will be used instead.If it is False, highlightcolour is used to highlight newly identified ERVs.

maincolour¶

string (hex colour code with #)
Default: #000000

Main colour for text in tree images. Default is black.

highlightcolour¶

string (hex colour code with #)
Default : #382bfe

Text colour for leaves highlighted in tree images - newly identified ERV-like regions. This is ignored if use_gene_colour above is True. Default is blue.

outgroupcolour¶

string (hex colour code with #)
Default: #0e954b

Text colour for outgroup in tree images. Default is green.

dpi¶

integer
Default: 300 Dots per inch for phylogenetic tree images.

format¶

string png, svg, pdf or jpg
Default: png

File format for phylogenetic tree images, can be svg, png, pdf or jpg

regions¶

Parameters for defining regions with ORFs from more than one retroviral gene

maxdist¶

integer
Default: 3000

Maximum distance (in nucleotides) between ORFs to be defined as part of the same ERV region.

maxoverlap¶

float
Default: 0.5

Maximum proportion of an ORF which can overlap with an ORF from another gene before they are filtered out.