Minor Output Files

1.Screen
2.Classify
3.ERVRegions

Screen

  1. initiate

  2. genomeToChroms

  3. prepDBs

  4. runExonerate

  5. cleanExonerate

  6. mergeOverlaps

  7. makeFastas

  8. renameFastas

  9. makeUBLASTDb

  10. runUBLASTCheck

  11. classifyWithExonerate

  12. getORFs

  13. checkORFsUBLAST

  14. assignGroups

  15. summariseScreen

  16. Screen

initiate

init.txt
Placeholder file to show initial checks have been run.

genomeToChroms

host_chromosomes.dir/*fasta
FASTA format files containing the input genome (or other sequence), divided into regions based on the genomesplits parameters.

prepDBs

gene_databases.dir/GENE.fasta
Copies of the fasta files of reference retroviral amino acid sequences.

runExonerate

raw_exonerate_output.dir/GENE_*.tsv
Raw “vulgar” output of Exonerate as described here

cleanExonerate

clean_exonerate_output.dir/GENE_*_unfiltered.tsv
Raw exonerate output converted into a table.
Columns:

  • query_id
    Reference amino acid sequence ID for retroviral gene

  • query_start
    Start position of match within reference sequence

  • query_end
    End position of match within reference sequence

  • query_strand
    Strand of match relative to reference sequence

  • target_id
    Nucleotide sequence from input which matched the retroviral gene.

  • target_start
    Start position of match within input sequence.

  • target_end
    End position of match within output sequence.

  • target_strand
    Strand of match relative to input sequence

  • score
    Exonerate score of match

  • details
    Additional columns (columns 11+) from Exonerate vulgar output, delimited by “|”.

  • length
    Length of match on input sequence

clean_exonerate_output.dir/GENE_*_filtered.tsv
Output of Exonerate filtered to remove regions containing introns and regions which are shorter than exonerate_min_hit_length. Columns are the same as in the unfiltered table.

clean_exonerate_output.dir/GENE_*.bed
ERV-like regions from the table above in bed

mergeOverlaps

gene_bed_files.dir/GENE_all.bed,
ERV-like regions from clean_exonerate_output.dir/GENE_*.bed combined into a single bed file.

gene_bed_files.dir/GENE_merged.bed
ERV-like regions from the previous file merged using Bedtools merge

makeFastas

gene_fasta_files.dir/GENE_merged.fasta
Fasta files of the merged regions in gene_bed_files.dir/GENE_merged.bed

renameFastas

gene_fasta_files.dir/GENE_merged_renamed.fasta
Fasta files from the makeFastas step with the ERV-like regions renamed with unique IDs.

makeUBLASTDb

UBLAST_db.dir/GENE_db.udb
UBLAST databases for the reference retroviral amino acid sequences.

runUBLASTCheck

ublast.dir/GENE_UBLAST_alignments.txt
Raw UBLAST output files for the Exonerate regions vs the retroviral amino acid databases. Equivalent to the BLAST pairwise output.

ublast.dir/GENE_UBLAST.tsv
UBLAST tabular output for Exonerate regions vs the retrovrial amino acid databases. Equivalent to the BLAST tabular output.

ublast.dir/GENE_filtered_UBLAST.fasta
Fasta file of the regions which passed the UBLAST filter.

classifyWithExonerate

exonerate_classification.dir/GENE_all_matches_exonerate.tsv
Raw output of ungapped Exonerate algorithm for the UBLAST verified regions against the ERV amino acid database.
Columns:

  • ID of newly identified ERV-like region

  • ID of reference ERV amino acid

  • Exonerate score

exonerate_classification.dir/GENE_best_matches_exonerate.tsv
The previous table filtered to list only the highest scoring hit for each ERV-like region.

exonerate_classification.dir/GENE_refiltered_matches_exonerate.fasta
Highest scoring hits from the previous table in FASTA format.

getORFs

ORFs.dir/GENE_orfs_raw.fasta
Raw output of running EMBOSS transeq -frame 6 on the regions output from classifyWithExonerate.

ORFs.dir/GENE_orfs_nt.fasta
ORFs longer than ORFs_min_orf_length as nucleotide sequences, with IDs redefined to include the chromosome, start and end position and strand of the ORF.

ORFs.dir/GENE_orfs_aa.fasta
ORFs longer than ORFs_min_orf_length as amino acid sequences, with IDs redefined to include the chromosome, start and end position and strand of the ORF.

checkORFsUBLAST

ublast_orfs.dir/GENE_UBLAST_alignments.txt
Raw UBLAST output files for the newly identified ORFs vs the retroviral amino acid databases. Equivalent to the BLAST pairwise output.

ublast_orfs.dir/GENE_UBLAST.tsv
UBLAST tabular output for newly identified ORFs vs the retrovrial amino acid databases. Equivalent to the BLAST tabular output.

ublast_orfs.dir/GENE_filtered_UBLAST.fasta
Fasta file of the newly identified ORFs which passed the UBLAST filter.

assignGroups

grouped.dir/GENE_groups.tsv
Summarised output for the previous steps. This is identical to screen_results.dir/results.tsv.

summariseScreen

FMT can be png, svg, pdf, jpg depending on the plot_format parameter

The major outputs of this function are stored in the screen_results.dir directory. Further details of these files are provided in the Main Output Files section.

The other files show the output of the intermediate steps.

Exonerate Initial

  • summary_tables.dir/exonerate_initial_summary.txt
    Summary of the output of the initial Exonerate screening step. Note that these are unfiltered and many will not be true ERVs.

  • summary_tables.dir/ublast_hits_initial_summary.txt
    Summary of the results of running UBLAST on the initial Exonerate output.

  • summary_tables.dir/orfs_initial_summary.txt
    Summary of the results of the initial ORF identification.

  • summary_tables.dir/ublast_orfs_initial_summary.txt
    Summary of the results of running UBLAST on these ORFs.

  • summary_plots.dir/exonerate_initial_lengths.FMT
    Histogram showing the lengths of the initial Exonerate regions for each gene.

  • summary_plots.dir/exonerate_initial_scores.FMT
    Histogram showing the Exonerate score of the initial Exonerate regions for each gene.

  • summary_plots.dir/exonerate_initial_strands.FMT
    Bar chart showing the number of regions identified on each strand in the initial Exonerate screen.

  • summary_plots.dir/exonerate_initial_by_sequence.FMT
    Histogram showing the number of ERV-like regions identified on each sequence in the reference genome being screened.

  • summary_plots.dir/exonerate_initial_counts_per_gene.FMT
    Bar chart showing the number of ERV regions identified per gene in the initial Exonerate screen.

UBLAST

  • summary_plots.dir/ublast_hits_alignment_length.FMT
    Histogram showing the lengths of the alignments of the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.

  • summary_plots.dir/ublast_hits_perc_similarity.FMT
    Histogram showing the percentage identity between the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.

  • summary_plots.dir/ublast_hits_perc_similarity.FMT
    Histogram showing the UBLAST bit score between the UBLAST filtered Exonerate regions and the most similar reference ORF, based on the UBLAST output.

  • summary_plots.dir/ublast_hits_by_match.FMT
    Bar chart showing the number of UBLAST filtered Exonerate regions most similar to each reference ORF in the ERVsearch/ERV_db database.

  • summary_plots.dir/ublast_hits_per_gene.FMT
    Bar chart showing the number of UBLAST filtered Exonerate regions identified per gene.

ORFs

  • summary_plots.dir/orfs_lengths.FMT
    Histogram of the lengths of ORFs identified in the ERV regions.

  • summary_plots.dir/orfs_strands.FMT
    Bar chart of the strand (positive (+) or negative (-) sense) of the ORFs identified in the ERV regions.

  • summary_plots.dir/orfs_by_gene.FMT
    Bar chart of the number of ORFs identified for each gene.

UBLAST ORFs

  • summary_plots.dir/ublast_orfs_alignment_length.FMT
    Histogram showing the lengths of the alignments of the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.

  • summary_plots.dir/ublast_orfs_perc_similarity.FMT
    Histogram showing the percentage identity between the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.

  • summary_plots.dir/ublast_orfs_bit_score.FMT
    Histogram showing the UBLAST bit score between the ERV-like ORFs and the most similar reference ORF, based on the UBLAST output.

  • summary_plots.dir/ublast_orfs_by_match.FMT
    Bar chart showing the number of ERV-like ORFs most similar to each reference ORF in the ERVsearch/ERV_db database.

  • summary_plots.dir/ublast_orfs_per_gene.FMT
    Bar chart showing the number of ERV-like ORFs identified per gene.

Classify

  1. makeGroupFastas

  2. makeGroupTrees

  3. drawGroupTrees

  4. makeSummaryFastas

  5. makeSummaryTrees

  6. drawSummaryTrees

  7. summariseClassify

  8. Classify

makeGroupFastas

group_fastas.dir/GENE_(.*)_GENUS.fasta
Fasta files for each small subgroup of ERV-like ORFs and reference sequences.

group_fastas.dir/GENE_(.*)_GENUS_A.fasta
Aligned version of the above Fasta file generated using MAFFT

makeGroupTrees

`group_trees.dir/GENE_(.*)_GENUS.tre
Phylogenetic trees in Newick format for each small subgroup of ERV-like ORFs and reference sequences generated using FastTree

drawGroupTrees

group_trees.dir/GENE_(.*_)GENUS.FMT (png, svg, pdf or jpg)
Image files of the phylogenetic trees for each small subgroup of ERV-like ORFs and reference sequences, with newly identified sequences highlighted.

makeSummaryFastas

summary_fastas.dir/GENE_GENUS.fasta
Fasta files for each retroviral gene and genus combining reference and newly identified sequences. Monophyletic groups of newly identified ORfs sequences are represented by a single sequence.

group_lists.dir/*tsv Lists of sequences in each monophyletic group of newly identified ORFs which has been collated to be represented by a single sequnce.

makeSummaryTrees

summary_trees.dir/GENE_GENUS.tre
Tree files in Newick format for each retroviral gene and genus combining reference and newly identified sequences. Monophyletic groups of newly identified ORfs sequences are represented by a single sequence.

drawSummaryTrees

summary_trees.dir/GENE_GENUS.FMT (FMT = png, svg, pdf or jpg)
Images files of the phylogenetic trees for each retroviral gene and genus. Different sized circles are used to show the relative size of collapsed monophyletic groups. Newly identified ERVs are highlighted.

summariseClassify

classify_results.dir/results.tsv
Table listing the number of genes which have been collapsed into each monophyletic group in the trees in the summary_trees.dir directory.
Columns:

  • gene Retroviral gene for this group

  • genus Retroviral genus for this group

  • group Group ID

  • count Number of sequences in this group

classify_results.dir/by_gene_genus.png
Bar chart showing the number of genes which have been collapsed into each monophyletic group, organised by gene and genus.

ERVRegions

  1. makeCleanBeds

  2. makeCleanFastas

  3. findERVRegions

  4. makeRegionTables

  5. summariseERVRegions

  6. plotERVRegions

  7. ERVRegions

  8. Full

makeCleanBeds

clean_beds.dir/GENE.bed
Bed files containing the co-ordinates all the ERV-like ORFs output by the Screen section.

makeCleanFastas

clean_fastas.dir/GENE.fasta
FASTA files of the ERV-like ORFs output by the Screen section.

findERVRegions

ERV_regions.dir/all_ORFs.bed
Concatenated version of the bed files output by the makeCleanBeds function with all three genes in a single file.

ERV_regions.dir/all_regions.bed
Previous bed file merged using Bedtools merge, with regions within regions_maxdist merged.

ERV_regions.dir/multi_gene_regions.bed
Filtered version of the previous bed file with only regions which were merged.

ERV_regions.dir/regions.fasta
Fasta file of the merged regions.

makeRegionTables

ERV_regions.dir/ERV_regions_final.tsv
Table showing the details of the combined regions containing ORFs resembling more than one retroviral gene. This table is identical to erv_region_results.dir/results.tsv.<br<

ERV_regions.dir/ERV_regions_final.bed
Bed file with the co-ordinates of the identified regions.

ERV_regions.dir/ERV_regions_final.fasta
FASTA file of the regions in the bed file above.

plotERVRegions

ERV_region_plots.dir/*.FMT
Plots showing the distributions of ORFs resembling each retroviral gene on the genome. Each gene is shown on a different line on the y axis, the x axis is chromosome co-ordinates. One plot is generated for each multi-gene region.

summariseERVRegions

erv_region_results.dir/results.tsv
Table showing the overall results for regions with ORFs resembling multiple retroviral genes. This table is described in full in the Main Output Files section.

erv_region_results.dir/erv_regions.png
Bar chart showing the number of ERV regions identified with each combination of retroviral genes.