Minor Output Files¶
1.Screen
2.Classify
3.ERVRegions
Screen¶
initiate¶
init.txt
Placeholder file to show initial checks have been run.
genomeToChroms¶
host_chromosomes.dir/*fasta
FASTA format files containing the input genome (or other sequence), divided into regions based on the genomesplits parameters.
prepDBs¶
gene_databases.dir/GENE.fasta
Copies of the fasta files of reference retroviral amino acid sequences.
runExonerate¶
raw_exonerate_output.dir/GENE_*.tsv
Raw “vulgar” output of Exonerate as described here
cleanExonerate¶
clean_exonerate_output.dir/GENE_*_unfiltered.tsv
Raw exonerate output converted into a table.
Columns:
query_id
Reference amino acid sequence ID for retroviral genequery_start
Start position of match within reference sequencequery_end
End position of match within reference sequencequery_strand
Strand of match relative to reference sequencetarget_id
Nucleotide sequence from input which matched the retroviral gene.target_start
Start position of match within input sequence.target_end
End position of match within output sequence.target_strand
Strand of match relative to input sequencescore
Exonerate score of matchdetails
Additional columns (columns 11+) from Exonerate vulgar output, delimited by “|”.length
Length of match on input sequence
clean_exonerate_output.dir/GENE_*_filtered.tsv
Output of Exonerate filtered to remove regions containing introns and regions which are shorter than exonerate_min_hit_length. Columns are the same as in the unfiltered table.
clean_exonerate_output.dir/GENE_*.bed
ERV-like regions from the table above in bed
mergeOverlaps¶
gene_bed_files.dir/GENE_all.bed
,
ERV-like regions from clean_exonerate_output.dir/GENE_*.bed
combined into a single bed file.
gene_bed_files.dir/GENE_merged.bed
ERV-like regions from the previous file merged using Bedtools merge
makeFastas¶
gene_fasta_files.dir/GENE_merged.fasta
Fasta files of the merged regions in gene_bed_files.dir/GENE_merged.bed
renameFastas¶
gene_fasta_files.dir/GENE_merged_renamed.fasta
Fasta files from the makeFastas
step with the ERV-like regions renamed with unique IDs.
makeUBLASTDb¶
UBLAST_db.dir/GENE_db.udb
UBLAST databases for the reference retroviral amino acid sequences.
runUBLASTCheck¶
ublast.dir/GENE_UBLAST_alignments.txt
Raw UBLAST output files for the Exonerate regions vs the retroviral amino acid databases. Equivalent to the BLAST pairwise output.
ublast.dir/GENE_UBLAST.tsv
UBLAST tabular output for Exonerate regions vs the retrovrial amino acid databases. Equivalent to the BLAST tabular output.
ublast.dir/GENE_filtered_UBLAST.fasta
Fasta file of the regions which passed the UBLAST filter.
classifyWithExonerate¶
exonerate_classification.dir/GENE_all_matches_exonerate.tsv
Raw output of ungapped Exonerate algorithm for the UBLAST verified regions against the ERV amino acid database.
Columns:
ID of newly identified ERV-like region
ID of reference ERV amino acid
Exonerate score
exonerate_classification.dir/GENE_best_matches_exonerate.tsv
The previous table filtered to list only the highest scoring hit for each ERV-like region.
exonerate_classification.dir/GENE_refiltered_matches_exonerate.fasta
Highest scoring hits from the previous table in FASTA format.
getORFs¶
ORFs.dir/GENE_orfs_raw.fasta
Raw output of running EMBOSS transeq -frame 6 on the regions output from classifyWithExonerate
.
ORFs.dir/GENE_orfs_nt.fasta
ORFs longer than ORFs_min_orf_length as nucleotide sequences, with IDs redefined to include the chromosome, start and end position and strand of the ORF.
ORFs.dir/GENE_orfs_aa.fasta
ORFs longer than ORFs_min_orf_length as amino acid sequences, with IDs redefined to include the chromosome, start and end position and strand of the ORF.
checkORFsUBLAST¶
ublast_orfs.dir/GENE_UBLAST_alignments.txt
Raw UBLAST output files for the newly identified ORFs vs the retroviral amino acid databases. Equivalent to the BLAST pairwise output.
ublast_orfs.dir/GENE_UBLAST.tsv
UBLAST tabular output for newly identified ORFs vs the retrovrial amino acid databases. Equivalent to the BLAST tabular output.
ublast_orfs.dir/GENE_filtered_UBLAST.fasta
Fasta file of the newly identified ORFs which passed the UBLAST filter.
assignGroups¶
grouped.dir/GENE_groups.tsv
Summarised output for the previous steps. This is identical to screen_results.dir/results.tsv
.
summariseScreen¶
FMT can be png, svg, pdf, jpg depending on the plot_format
parameter
The major outputs of this function are stored in the screen_results.dir directory. Further details of these files are provided in the Main Output Files section.
The other files show the output of the intermediate steps.
Exonerate Initial
summary_tables.dir/exonerate_initial_summary.txt
Summary of the output of the initial Exonerate screening step. Note that these are unfiltered and many will not be true ERVs.summary_tables.dir/ublast_hits_initial_summary.txt
Summary of the results of running UBLAST on the initial Exonerate output.summary_tables.dir/orfs_initial_summary.txt
Summary of the results of the initial ORF identification.summary_tables.dir/ublast_orfs_initial_summary.txt
Summary of the results of running UBLAST on these ORFs.summary_plots.dir/exonerate_initial_lengths.FMT
Histogram showing the lengths of the initial Exonerate regions for each gene.summary_plots.dir/exonerate_initial_scores.FMT
Histogram showing the Exonerate score of the initial Exonerate regions for each gene.summary_plots.dir/exonerate_initial_strands.FMT
Bar chart showing the number of regions identified on each strand in the initial Exonerate screen.summary_plots.dir/exonerate_initial_by_sequence.FMT
Histogram showing the number of ERV-like regions identified on each sequence in the reference genome being screened.summary_plots.dir/exonerate_initial_counts_per_gene.FMT
Bar chart showing the number of ERV regions identified per gene in the initial Exonerate screen.
UBLAST
summary_plots.dir/ublast_hits_alignment_length.FMT
Histogram showing the lengths of the alignments of the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_perc_similarity.FMT
Histogram showing the percentage identity between the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_perc_similarity.FMT
Histogram showing the UBLAST bit score between the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_hits_by_match.FMT
Bar chart showing the number of UBLAST filtered Exonerate regions most similar to each reference ORF in the ERVsearch/ERV_db database.summary_plots.dir/ublast_hits_per_gene.FMT
Bar chart showing the number of UBLAST filtered Exonerate regions identified per gene.
ORFs
summary_plots.dir/orfs_lengths.FMT
Histogram of the lengths of ORFs identified in the ERV regions.summary_plots.dir/orfs_strands.FMT
Bar chart of the strand (positive (+) or negative (-) sense) of the ORFs identified in the ERV regions.summary_plots.dir/orfs_by_gene.FMT
Bar chart of the number of ORFs identified for each gene.
UBLAST ORFs
summary_plots.dir/ublast_orfs_alignment_length.FMT
Histogram showing the lengths of the alignments of the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_perc_similarity.FMT
Histogram showing the percentage identity between the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_bit_score.FMT
Histogram showing the UBLAST bit score between the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.summary_plots.dir/ublast_orfs_by_match.FMT
Bar chart showing the number of ERV-like ORFs most similar to each reference ORF in the ERVsearch/ERV_db database.summary_plots.dir/ublast_orfs_per_gene.FMT
Bar chart showing the number of ERV-like ORFs identified per gene.
Classify¶
makeGroupFastas¶
group_fastas.dir/GENE_(.*)_GENUS.fasta
Fasta files for each small subgroup of ERV-like ORFs and reference sequences.
group_fastas.dir/GENE_(.*)_GENUS_A.fasta
Aligned version of the above Fasta file generated using MAFFT
makeGroupTrees¶
`group_trees.dir/GENE_(.*)_GENUS.tre
Phylogenetic trees in Newick format for each small subgroup of ERV-like ORFs and reference sequences generated using FastTree
drawGroupTrees¶
group_trees.dir/GENE_(.*_)GENUS.FMT
(png, svg, pdf or jpg)
Image files of the phylogenetic trees for each small subgroup of ERV-like ORFs and reference sequences, with newly identified sequences highlighted.
makeSummaryFastas¶
summary_fastas.dir/GENE_GENUS.fasta
Fasta files for each retroviral gene and genus combining reference and newly identified sequences. Monophyletic groups of newly identified ORfs sequences are represented by a single sequence.
group_lists.dir/*tsv
Lists of sequences in each monophyletic group of newly identified ORFs which has been collated to be represented by a single sequnce.
makeSummaryTrees¶
summary_trees.dir/GENE_GENUS.tre
Tree files in Newick format for each retroviral gene and genus combining reference and newly identified sequences. Monophyletic groups of newly identified ORfs sequences are represented by a single sequence.
drawSummaryTrees¶
summary_trees.dir/GENE_GENUS.FMT
(FMT = png, svg, pdf or jpg)
Images files of the phylogenetic trees for each retroviral gene and genus. Different sized circles are used to show the relative size of collapsed monophyletic groups. Newly identified ERVs are highlighted.
summariseClassify¶
classify_results.dir/results.tsv
Table listing the number of genes which have been collapsed into each monophyletic group in the trees in the summary_trees.dir
directory.
Columns:
gene Retroviral gene for this group
genus Retroviral genus for this group
group Group ID
count Number of sequences in this group
classify_results.dir/by_gene_genus.png
Bar chart showing the number of genes which have been collapsed into each monophyletic group, organised by gene and genus.
ERVRegions¶
makeCleanBeds¶
clean_beds.dir/GENE.bed
Bed files containing the co-ordinates all the ERV-like ORFs output by the Screen section.
makeCleanFastas¶
clean_fastas.dir/GENE.fasta
FASTA files of the ERV-like ORFs output by the Screen section.
findERVRegions¶
ERV_regions.dir/all_ORFs.bed
Concatenated version of the bed files output by the makeCleanBeds
function with all three genes in a single file.
ERV_regions.dir/all_regions.bed
Previous bed file merged using Bedtools merge, with regions within regions_maxdist merged.
ERV_regions.dir/multi_gene_regions.bed
Filtered version of the previous bed file with only regions which were merged.
ERV_regions.dir/regions.fasta
Fasta file of the merged regions.
makeRegionTables¶
ERV_regions.dir/ERV_regions_final.tsv
Table showing the details of the combined regions containing ORFs resembling more than one retroviral gene. This table is identical to erv_region_results.dir/results.tsv
.<br<
ERV_regions.dir/ERV_regions_final.bed
Bed file with the co-ordinates of the identified regions.
ERV_regions.dir/ERV_regions_final.fasta
FASTA file of the regions in the bed file above.
plotERVRegions¶
ERV_region_plots.dir/*.FMT
Plots showing the distributions of ORFs resembling each retroviral gene on the genome. Each gene is shown on a different line on the y axis, the x axis is chromosome co-ordinates. One plot is generated for each multi-gene region.
summariseERVRegions¶
erv_region_results.dir/results.tsv
Table showing the overall results for regions with ORFs resembling multiple retroviral genes. This table is described in full in the Main Output Files section.
erv_region_results.dir/erv_regions.png
Bar chart showing the number of ERV regions identified with each combination of retroviral genes.