This manual provides the usage details for GET_PHYLOMARKERS, a software package primarily designed to select “well-behaved” phylogenetic markers to estimate a maximum likelihoood (ML) species tree from the supermatrix of concatenated, top-scoring alignments. These are identified through a series of sequential filters that operate on orthologous gene/transcript/protein clusters computed by GET_HOMOLOGUES to exclude:
However, GET_PHYLOMARKERS can also estimate maximum likelihood and parsimony trees from pan-genome matrices, as well as computing basic population genetic statistics and neutrality tests, as shown in the associated publication (Vinuesa et. al, 2018) and explained below.
Figure 1 provides a graphical overview of the GET_PHYLOMARKERS pipeline. The Manual will describe in detail each of these steps along with the options available to the user to control the pipeline’s behaviour, the stringency of the filters, as well as the number of substitution models evaluated and tree-search thoroughness. In addition, the script estimate_pangenome_phylogenies.sh can search for ML and parsimony pan-genome phylogenies using the pan-genome matrix computed by compare_clusters.pl from the GET_HOMOLOGUES suite, as shown in the pipeline’s flowchart below.
The pipeline has an associated publication: Pablo Vinuesa, Luz-Edith Ochoa-Sanchez and Bruno Contreras-Moreira (2018). GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics and inferring pan-genome phylogenies, used for a critical geno-taxonomic revision of the genus Stenotrophomonas. Front. Microbiol. | doi: 10.3389/fmicb.2018.00771
The GET_PHYLOMARKERS package is distributed in three formats:
For detailed instructions on installing the external dependencies please check INSTALL.md.
We highly recommend installing the GET_HOMOLOGUES+GET_PHYLOMARKERS Docker image or GET_PHYLOMARKERS Docker image to avoid potential problems with the installation of dependencies and guarantee access to the full functionality of both packages with minimal effort. Running containerized bioinformatics applications and pipelines provides many advantages, including reproducibility Gruening et. al 2021. If you have not set a Docker engine+client on your machine, please see the instructions provided in the INSTALL.md document. It explains how to install Docker on different platforms and how to download/upgrade the GET_HOMOLOGUES+GET_PHYLOMARKERS and GET_PHYLOMARKERS from the Docker Hub image registry.
GET_PHYLOMARKERS (Vinuesa et. al, 2018) implements a series of sequential filters (Fig. 1 and explained below) to selects markers from the homologous gene clusters produced by GET_HOMOLOGUES with optimal attributes for phylogenomic or population genetic inference. It estimates gene-trees and species-trees under the maximum likelihood (ML) optimality criterion using state-of-the-art fast ML tree searching algorithms. The species tree is estimated from the supermatrix of concatenated, top-scoring alignments that passed the quality filters (Fig. 1). The stringency of these filters and the thoroughness of the ML tree searches can be controlled by the user, although sensible defaults are provided, making it an easy-to-use, user-friendly software.
GET_HOMOLOGUES is a genome-analysis software package for microbial pan-genomics and comparative genomics originally described in the following publications:
More recently we have developed GET_HOMOLOGUES-EST, which can be used to cluster eukaryotic genes and transcripts, as described in Contreras-Moreira et al, Front. Plant Sci. 2017.
If GET_HOMOLOGUES_EST is fed both .fna and .faa files of CDS sequences it will produce identical output to that of GET_HOMOLOGUES and thus can be analyzed with GET_PHYLOMARKERS all the same.
GET_PHYLOMARKERS is primarily tailored towards selecting CDSs (gene markers) from multiple genome sequences to infer phylogenies of different species of the same genus or family. It can also select optimal markers for population genetics, when the source genomes belong to the same species. For more divergent genome sequences, classified in different genera, families, orders or higher taxa, the pipeline should be run using proteins instead of DNA sequences.
# 1. Default run: launches the IQT-based filtering pipeline, selecting best models for gene
# trees and fitting
# GTR+F+RATE models for the species tree search based on the supermatrix of concatenated,
# top-scoring alignments.
run_get_phylomarkers_pipeline.sh -R 1 -t DNA
# 2. Same as above, but adding molecular-clock analysis, assuming a HKY85+G substitution
# model and imposing a stronger filtering of input alignments based on kdetree -k 1
# and minimum average SH-like gene-tree support values of 0.8 (see manual for the details)
run_get_phylomarkers_pipeline.sh -R 1 -t DNA -K -M HKY -m 0.8 -k 1.0
# population-genetics mode
run_get_phylomarkers_pipeline.sh -R 2 -t DNA
# 3. Protein alignments, user-defined kdetrees & mean branch support cut-off values
run_get_phylomarkers_pipeline.sh -R 1 -t PROT -k 1.2 -m 0.7
# 4. Estimate a ML pan-genome tree from the pan-genome matrix, using 5 independent IQT runs and UFBoot
estimate_pangenome_phylogenies.sh -f pangenome_matrix_t0.fasta -r 5 -S UFBoot
# 5. Estimate a PARS pan-genome tree with bootstrapping; 100 bootstrap replicates divided on 10 core (10 reps / core)
estimate_pangenome_phylogenies.sh -c PARS -R 3 -i pangenome_matrix_t0.phylip -n 10 -b 10 -j 1 -t 1
## To run the pipeline on a remote server, we recommend using the nohup command upfront,
# as shown below:
# in this case, calling also IQ-TREE, which will select among the (HKY,TN,TVM,TIM,SYM,
# GTR)+RATE models and do 5 independent tree searches under the best-fit model,
# computing ultrafast bootstrapp and aproximate Bayes branch support values
nohup run_get_phylomarkers_pipeline.sh -R 1 -t DNA -S 'HKY,TN,TVM,TIM,SYM,GTR' \
-k 1.0 -m 0.7 -T high -N 5 &> log.get_phylomarkers_pipeline &
# use tail -f log.get_phylomarkers_pipeline to follow the output that the script logs to the logfile
tail -f log.get_phylomarkers_pipeline
NOTE: both .faa and .fna files are required to generate codon alignments from DNA fasta files. This means that two runs of compare_clusters.pl (from the GET_HOMOLOGUES package) are required, one of them using the -n flag GET_PHYLOMARKERS tutorial
run_get_phylomarkers_pipeline.sh is intended to run on a collection of single-copy sequence clusters from different species or strains.
NOTES: an absolute minimum of 4 distinct haplotypes per cluster are required for the cluster to be evaluated. Clusters with < 4 distinct sequences are automatically discarded, printing a warning message to screen. This means that at least 4 distinct genomes should be used as input.
However, the power of the pipeline for selecting optimal genome loci
for phylogenomics improves when a larger number of non-redundant genome
sequences are available for analysis. Reasonable numbers lie in the range of
10 to 200 distinct genomes from multiple species of a genus, family, order or
phylum.
The pipeline may not perform satisfactorily with too distant genome sequences,
particularly when sequences with significantly distinct nucleotide or aminoacid
compositions are used. This type of sequence heterogeneity is well known to
cause systematic bias in phylogenetic inference.
run_get_phylomarkers_pipeline.sh uses a hierarchical filtering scheme, summarized in Fig. 1 and detailed below:
Codon or protein alignments (depending on runmode) are first screened with Phi-test (Bruen et al. 2006) for the presence of potential recombinant sequences (Fig. 1). It is a well established fact that recombinant sequences negatively impact phylogenetic inference when using algorithms that do not account for the effects of this evolutionary force. The permutation test with 1000 resamplings is used to compute the p-values. These are considered significant if p < 0.05. Some loci may not contain sufficient polymorphisms for the test to work. In that case, the main script assumes that the locus does not contain recombinant sites, but prints a warning message to screen and to the logfile.
The next filtering step (Fig. 1) is provided by the kdetrees-test, which checks the distribution of topologies, tree lengths and branch lengths. kdetrees is a non-parametric method for estimating distributions of phylogenetic trees (Weyenberg et al. 2014), with the goal of identifying trees that are significantly different from the rest of the trees in the sample, based on the analysis of topology and branch length distributions. Such “outlier trees” may arise for example from horizontal gene transfers or gene duplication (and subsequent neofunctionalization), followed by differential loss of paralogues among lineages. Such processes will cause the affected genes to exhibit a significantly deviating phylogeny from that displayed by the bulk of genes, which are expected to be generated by the (multispecies) coalescent as species or populations diverge. Alignments producing significantly deviating trees in the kdetree test are identified and saved in the kde_outliers/ directory. The corresponding alignments are not used in downstream analyses.
* Parameter for controlling kdetrees stingency:
-k <real> kde stringency (0.7-1.6 are reasonable values; less is more stringent)
[default: 1.5]
The alignments passing the two previous filters Fig. 1 are subjected to maximum likelihood (ML) tree searches with FastTree or IQ-TREE (default) to infer the corresponding ML gene trees. Their phylogenetic signal is computed from the Shimodaria-Hasegawa-like likelihood ratio test branch support values, which vary between 0-1, as reported previously (Vinuesa et al. 2008). The support values of each internal branch or bipartition are parsed to compute the mean support value for each tree. Alignments/Trees with a mean support value below a cutoff threshold are discarded.
* Parameters controlling filtering based on mean support values.
-m <real> min. average support value (0.7-0.8 are reasonable values)
for trees/loci to be selected as informative [default: 0.75]
run_get_phylomarkers_pipeline.sh calls the auxiliary script run_parallel_molecClock_test_with_paup.sh to evaluate the global molecular clock hypothesis on the top markers, selected according to the criteria explained in the three previous sections (Fig. 1). The script calls paup* to evaluate the free-rates and clock hypothesis using likelihood ratio tests computed with R. Currently this is only performed on codon alignments.
As of version 1.9.9.0_22Dec17, GET_PHYLOMERKERS implements the -A I option, which calls IQ-TREE for ML tree searching Nguyen et. al (2015). This is the most recent fast ML software on the scene, which was developed with the aim of escaping from early local maxima encountered during “hill-climbing” by generating multiple seed trees to initiate tree searches and maintaining a pool of candidate seed trees during the entire run. Overall, it was the best-performing ML tree search algorithm among those evaluated by Zhou et al. (2017) in their above-mentioned benchmarking paper, at least for datasets < 200 taxa. For larger datasets (several hudreds to thousands of sequences), current implementations of RAxML and ExaML, which make heavy use of SPR-moves, may outperform IQ-TREE, which makes more intensive use of local NNI-moves.
Our benchmark analyses have shown that IQ-TREE (v1.6.1) Nguyen et. al (2015) runs quickly enough when the -fast flag is passed to make it feasible to include a model selection step using ModelFinder (Kalyaanamoorthy et al. 2017) when estimating gene trees withouth incurring in excessively long computation times.
Combined with its superior tree-searching algorithm allows IQT to rank as the clear winner in our benchmarks. Therefore, from version 2.0 onwards, GET_PHYLOMARKERS uses IQT as its default tree searching algorithm, both for the estimation of gene-trees and the species-tree (from the concatenated, top-scoring alignments).
All gene tree searches are run in parallel under the modelset with the following parameters: -mset XXX -m MFP -nt 1 -alrt 1000 -fast
To keep computation times within reasonable bounds, the number of models evaluated by ModelFinder (integrated in IQ-TREE) differ for the gene-tree and species-tree search phases, as detailed below:
# IQT gene tree searches (hard-coded): -T <high|medium|low|lowest> [default: medium]
- high: -m MFP -nt 1 -alrt 1000 -fast
- medium: -mset K2P,HKY,TN,TNe,TIM,TIMe,TIM2,TIM2e,TIM3,TIM3e,TVM,TVMe,GTR
- low: -mset K2P,HKY,TN,TNe,TVM,TVMe,TIM,TIMe,GTR
- lowest: -mset K2P,HKY,TN,TNe,TVM,TIM,GTR
# IQT species-tree searches on the supermatrix:
-S <string> quoted 'comma-separated list' of base models to be evaluated by IQ-TREE
when estimating the species tree from the concatenated supermatrix.
# If no -S is passed, then sinlge default models are used, as shown below:
- for DNA alignments [default: GTR]
<'JC,F81,K2P,HKY,TrN,TNe,K3P,K81u,TPM2,TPM2u,TPM3,TPM3u,TIM,TIMe,TIM2,TIM2e,TIM3,TIM3e,
TVM,TVMe,SYM,GTR'>
- for PROT alignments [default: LG]
<'BLOSUM62,cpREV,Dayhoff,DCMut,FLU,HIVb,HIVw,JTT,JTTDCMut,LG,mtART,mtMAM,mtREV,mtZOA,
Poisson,PMB,rtREV,VT,WAG'>
In addition, if -T high, run_get_phylomarkers_pipeline.sh will launch -N
After selecting the best substitution model, which includes taking care of among-site rate variation, IQ-TREE will search for the best tree, including bootstrapping with the UFBoot2 ultrafast bootstrapping algorithm (Hoang et al. 2017) and estimation of approximate Bayes branch support values.
Note that as of version 1.9.10 (January 1st, 2018), the script calls IQ-TREE 1.6.1 with the -fast flag enabled for maximal speed.
# 1. Default IQ-TREEE run (-T medium), evaluating the corresponding set models during
# gene-tree searches and evaluating the base GTR model on the concatenated DNA
# supermatrix, using a single search
run_get_phylomarkers_pipeline.sh -R 1 -t DNA
# 2. Run 10 independent IQ-TREEE runs on a concatenated DNA supermatrix, evaluating the
# TN,TIM,TVM,GTR base models
run_get_phylomarkers_pipeline.sh -R 1 -t DNA -S 'TN,TIM,TVM,GTR' -k 0.9 -m 0.8 -T high \
-N 10 &> /dev/null &
# 3. Run 5 independent IQ-TREEE runs on a concatenated PROT supermatrix,
# evaluating the LG,WAG,JTT matrices
run_get_phylomarkers_pipeline.sh -R 1 -t PROT -A I -S 'LG,WAG,JTT,VT' -k 1.0 -m 0.7 \
-T high -N 5 &> /dev/null &
# 4. To run the pipeline on a remote server, we recommend using the nohup command upfront,
# as shown below:
nohup run_get_phylomarkers_pipeline.sh -R 1 -t DNA -S 'HKY,TN,TVM,TIM,SYM,GTR' -k 1.0 \
-m 0.7 -T high -N 5 &> /dev/null &
-nt -gtr -gamma -bionj -slownni -mlacc 3 -spr 4 -sprlength 10
high: -nt -gtr -gamma -bionj -slow -slownni -mlacc 3 -spr 8 -sprlength 10
medium: -nt -gtr -gamma -bionj -slownni -mlacc 3 -spr 4 -sprlength 10
low: -nt -gtr -gamma -bionj -slownni -spr 4 -sprlength 10
lowest: -nt -gtr -gamma -mlnni 4
where ‘-s spr’ and ‘-l spr_length’ can be set by the user. The lines above show their default values.
high: -lg -gamma -bionj -slow -slownni -mlacc 3 -spr 4 -sprlength 10
# FastTree searching on a huge protein dataset for fast inspection
run_get_phylomarkers_pipeline.sh -R 1 -A F -t PROT -m 0.6 -k 1.0 -T lowest
# FastTree searching on DNA dataset using the most through possible search and extensive
SPR rounds
run_get_phylomarkers_pipeline.sh -R 1 -A F -t DNA -m 0.7 -k 1.0 -T high -s 20 -l 12
GET_PHYLOMARKERS is designed, created and maintained at the Center for Genomic Sciences of Universidad Nacional Autónoma de México (CCG/UNAM) and the Laboratory of Computational Biology at Estación Experimental de Aula Dei/CSIC in Zaragoza (Spain).
The pipeline code and accessory scripts were written Pablo Vinuesa and Bruno Contreras-Moreira, making use of the third party code libraries and binaries listed below. Please refer to INSTALL.md for installation instructions.
From the BioPerl suite:
The GET_PHYLOMARKERS distribution provides a test_sequences/ directory which holds the subdirectories core_genome/, pan_genome/ and pIncAC/. The first one contains *.fna and *.faa FASTA files corresponding to consensus (BDBH, COGtriangles and OMCL) core-genome clusters computed with GET_HOMOLOGUES from a set of 12 GenBank-formatted pIncA/C plasmids, provided in the pIncAC/ directory. The second one holds the pan-genome matrix computed by compare_clusters.pl from the GET_HOMOLOGUES suite in tabular (*.tab), FASTA (*.fasta) and phylip (*.phy) formats. The pIncAC/ directory holds the source *.gbk GenBank files. This directory has a README.txt file that briefly describes the GenBank files.
These directories allow you to:
Easily and rapidly test the GET_PHYLOMARKERS suite working on core-genome orthologous clusters (core_genome/)
Test the GET_PHYLOMARKERS suite working on the pan-genome matrix (pan_genome/)
Run the comlete GET_HOMOLOGUES + GET_PHYLOMARKERS pipelines from scratch (pIncAC/)
The first section below will first show how to share data between the host (your) machine and the GET_HOMOLOGUES + GET_PHYLOMARKERS Docker container.
The following ones provide code examples on how to run the full GET_HOMOLOGUES + GET_PHYLOMARKERS pipelines using the test sequences.
GET_HOMOLOGUES very conveniently allows users to run different clustering algorithms to define homologous gene families. Here we will show hot to do it running either a Docker container or directly from your host machine.
If you are not running in the Docker container, go (cd) into the distribution directory and cd into the subidrectory test_sequences/ and issue the commands shown below.
If you launched a Docker container, and assuming that you followed the Docker tutorial presented in the previous section, just make sure that you are in ~/get_homPhy/.
# 1. cd into the directory holding the test_sequences
cd test_sequences/
ls # will list the following directories: core_genome/, pan_genome/ and pIncAC/
# 2. run GET_HOMOLOGUES to compute the set of homologous clusters using the BDBH,
# COGtriangles and OMCL clustering algorithms
get_homologues.pl -d pIncAC -t 12 -e -n 1 # BDBH clusters containing sequences for the
# 12 genomes,
# excluding those with inparalogues (-e):
# 33 clusters found.
# Takes 330 wallclock secs on a comodity desktop with
# Intel Core2 Quad CPU Q9400 @ 2.66GHz x 1 cores
# if you want, you can suspend the job and put it to the background
CTRL-Z
bg
# and run the top command to see the processes running
top
# hit q to quit monitoring the processes running on your system (docker or host) with
# the top command
get_homologues.pl -d pIncAC -G -t 0 # COGtriangles, computing clusters of all sizes (-t 0)
# finds 408 clusters in 14 wallclock secs, as it recycles
# the blast results of the previous run.
get_homologues.pl -d pIncAC/ -M -t 0 # OMCL, finds 393 clusters in 5 wallclock secs.
GET_HOMOLOGUES is unique among similar software in many respects, like its capacity to compute consensus clusters from the solutions generated by the three clustering algorithms it implements, as demonstrated below.
To learn all details around running GET_HOMOLOGUES, please read the online GET_HOMOLOGUES manual
# Compute consensus core-genome clusters using compare_clusters.pl of the GET_HOMOLOGUES suite.
# Note that we run the script twice, once with the -n flag
# (to compute the consesus clusters at the DNA level, producing *.fna files)
# and a second instance without the flag, to get the protein clusters (*.faa files)
cd pIncAC_homologues/
find . -type d
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12 -n
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12
Figure 2. Venn diagram depicting the consensus core-genome of 31 genes for the 12 input pIncA/C sequences. This is the consensus set provided in the test_sequences/core_genome/ directory
To compute a consensus pan-genome, only the COGtriangles and OMCL algorithms should be used, as they don’t require a reference sequence. Run GET_HOMOLOGUES with the -t 0 option, to consider clusters of all occupancy sizes (from singletons to multigene families), as demonstrated below.
# Compute consensus pan-genome clusters and matrix using compare_clusters.pl of the
# GET_HOMOLOGUES suite.
# Note that we run the script twice, once with the -n and -m flags
# (to compute the consesus clusters at the DNA level, *.fna files
# and the pan-genome matrix) and a second instance without these flags,
# to get the protein clusters (*.faa files).
# Note also that we exclude the directory holding the BDBH clusters,
# as these are not suitable to compute a proper pan-genome matrix, since the
# BDBH clusters always contain the reference sequence, missing those that do not contain it.
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM -n -m
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM
Figure 3. Venn diagram depicting the consensus pan-genome clusters for the 12 input pIncA/C sequences contains 373 cluster. The corresponding pan-genome matrix is provided in the test_sequences/pan_genome/ directory
# assumes that you are within test_sequences/pIncAC_homologues/ or in test_sequences/core_BCM
cd core_BCM
# 1. make sure we have the same nuber of faa and fna cluster files;
# Note that GET_PHYLOMARKERS will check that for you automatically ;)
find . -name '*.faa' | wc -l # 31
find . -name '*.fna' | wc -l # 31
# 2.1 view the help menu of the main script run_get_phylomarkers_pipeline.sh
run_get_phylomarkers_pipeline.sh -h
# 2.2 run the pipeline under default values. See the manual for the many additional options
run_get_phylomarkers_pipeline.sh -R 1 -t DNA # takes 55 seconds on the above-mentioned machine
GET_PHYLOMARKERS prints informative progress messages to screen and to a logfile. Compare them with the pipeline’s flowchart shown in Fig. 1 to make shure you understand what computations are performed at each step.
# >>> explore the directories:
ls
cd get_phylomarkers_run_AIR1tDNA_k1.5_m0.7_Tmedium_30Jan18 # holds the compressed input
# FASTAS and alignments
# >>> cd into PhiPac/ dir, which holds the results of the recombination tests.
cd PhiPack
# have a peak a the file
head Phi_results_30Jan18.tsv
## ./1961_hypothetical_protein_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1962_DNA_topoisomerase_II_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1964_hypothetical_protein_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1966_site-specific_DNA_me_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1967_hypothetical_protein_cdnAln_Phi.log 1 1 TOO_FEW_INFORMATIVE_SITES
## ./1968_hypothetical_protein_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1969_putative_type_I_rest_cdnAln_Phi.log 1 1 TOO_FEW_INFORMATIVE_SITES
## ./1970_KorB_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1971_ParA_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1984_hypothetical_protein_cdnAln_Phi.log 1.00e+00 1.00e+00
Depending on the type of the input sequences (-t DNA|PROT), the results of the different phylogenetic filters performed on gene trees will live either in the non_recomb_cdn_alns (-t DNA) or in the non_recomb_FAA_alns (-t PROT) directories.
# >>> cd into the directory holding the non-recombinant sequences,
# which contains the results of several phylogenetic attributes of the loci
cd non_recomb_cdn_alns/
# 1.1 to identify the kde-test outlier loci see the contents of kde_outliers, in this case only 1 "outlier" locus was detected
ls kde_outliers
# 1.2 the details of the kdetrees test are provided in tsv-formatted file kde_dfr_file_all_gene_trees.tre.tab
head kde_dfr_file_all_gene_trees.tre.tab
## file kde_topo_dens kde_topo_test kde_bl_dens kde_bl_test kde_bl_topo_test
## 1961_hypothetical_protein_cdnAln.fasta.treefile 0.288751935193142 ok 63.2675110841703 ok ok
## 1962_DNA_topoisomerase_II_cdnAln.fasta.treefile 0.252919761356219 ok 28.7241410392773 ok ok
## 1964_hypothetical_protein_cdnAln.fasta.treefile 0.287751455109637 ok 17.2677657886815 ok ok
## 1966_site-specific_DNA_me_cdnAln.fasta.treefile 0.240400359789596 outlier 27.0113341481227 ok outlier
## 1967_hypothetical_protein_cdnAln.fasta.treefile 0.252639796140819 ok 95.8112601171774 ok ok
## 1968_hypothetical_protein_cdnAln.fasta.treefile 0.297536609843425 ok 96.2903342166101 ok ok
## 1969_putative_type_I_rest_cdnAln.fasta.treefile 0.314498361683638 ok 95.2603170008101 ok ok
## 1970_KorB_cdnAln.fasta.treefile 0.331685645962565 ok 79.0447022830493 ok ok
## 1971_ParA_cdnAln.fasta.treefile 0.334991727858742 ok 55.4211725899592 ok ok
# 2.1 have a look at the counts of best-fitting models selected
cat IQT_best_model_counts_for_gene_trees.tsv
## model count
## HKY+F 2
## HKY+F+G4 1
## K2P 26
## K2P+G4 1
## TIMe 1
# 2.2 and locus-specific stats are provided in IQT_DNA_gene_tree_Tmedium_stats.tsv
head IQT_DNA_gene_tree_Tmedium_stats.tsv
## alignment wc_secs CPU_secs lnL model s_type
## ./1961_hypothetical_protein_cdnAln.fasta.log 0.164 0.112 -1107.994 K2P IQTdnaTmedium
## ./1962_DNA_topoisomerase_II_cdnAln.fasta.log 0.251 0.184 -4124.482 HKY+F+G4 IQTdnaTmedium
## ./1964_hypothetical_protein_cdnAln.fasta.log 0.284 0.172 -1225.145 K2P+G4 IQTdnaTmedium
## ./1966_site-specific_DNA_me_cdnAln.fasta.log 0.204 0.108 -2183.089 K2P IQTdnaTmedium
## ./1967_hypothetical_protein_cdnAln.fasta.log 0.089 0.072 -870.935 K2P IQTdnaTmedium
## ./1968_hypothetical_protein_cdnAln.fasta.log 0.099 0.076 -1179.667 K2P IQTdnaTmedium
## ./1969_putative_type_I_rest_cdnAln.fasta.log 0.064 0.060 -1111.931 K2P IQTdnaTmedium
## ./1970_KorB_cdnAln.fasta.log 0.130 0.116 -2112.818 TIMe IQTdnaTmedium
## ./1971_ParA_cdnAln.fasta.log 0.129 0.072 -1157.621 K2P IQTdnaTmedium
# 3. graphical summaries of the results of the kdetrees and the distribuions of support-values are provided as svg files.
ls -1 *svg
## dotplot_and_bxplot_kdeDensity_dist_dissim_topo_TRUE.svg
## parallel_bxplots_kdeDensity_dist_dissim_topo_TRUE-FALSE.svg
## scatterplot_for_gene_tree_support_values.svg
Figure 4 below depicts the results of the non-parametric kdetrees test, run at the default stringency level of k = 1.5. As depicted on the graph, only one outlier is detected based on the topology (lower panel). Note that this and following graphs plot the results of a small (toy) dataset. They become more useful with larger ones.
IMPORTANT NOTE: to visualize the figures, you need to access the corresponding files from your local host, as the Docker container does not provide a graphical environment with visualization tools! Just open a second terminal (or use your system’s file browser) to access the working directory and display the figures.
Figure 5 depicts a scatterplot and a histogram summarizing the distribution of mean SH-support values for the 25 gene trees that reached this point in the pipeline.
Finally we will inspect the contents of the top_15_markers_ge70perc/ directory, which holds the top-scoring markers that passed the above-mentioned filters (Fig. 1). In this directory you will find the supermatrix resulting from their concatenation and the “species-tree” estimated from it under the best-fitting model identified by ModelFinder/IQ-TREE. It contains also the resutls of RF-distances of gene trees to species-tree and the global molecular clock test results if the main script was invoked with -R 1 -t DNA -K
# >>> inspecting the contents of the top_15_markers_ge70perc/ directory
cd top_15_markers_ge70perc
# The concatenation coordinates for the supermatrix are saved in concatenation_coordinates.txt
cat concatenation_coordinates.txt
## concatenation coordinates:
## 1961_hypothetical_protein_cdnAln.fasta = 1-615
## 1967_hypothetical_protein_cdnAln.fasta = 616-1113
## 1968_hypothetical_protein_cdnAln.fasta = 1114-1785
## 1970_KorB_cdnAln.fasta = 1786-2961
## 1971_ParA_cdnAln.fasta = 2962-3642
## 1984_hypothetical_protein_cdnAln.fasta = 3643-3975
## 1989_hypothetical_protein_cdnAln.fasta = 3976-4830
## 1990_hypothetical_protein_cdnAln.fasta = 4831-5346
## 1994_DSBA_oxidoreductase_cdnAln.fasta = 5347-6204
## 1995_putative_signal_pept_cdnAln.fasta = 6205-7110
## 1996_hypothetical_protein_cdnAln.fasta = 7111-7962
## 1997_hypothetical_protein_cdnAln.fasta = 7963-8403
## 1998_hypothetical_protein_cdnAln.fasta = 8404-8922
## 1999_hypothetical_protein_cdnAln.fasta = 8923-9381
## 2012_TraF_cdnAln.fasta = 9382-10395
# some phylogenetic properties of the markers are summarized in sorted_aggregated_support_values4loci.tab
# graphical analysis of RF-distances of gene-trees to the species-tree is found in
# scatterplot_RF-dist_of_gene_trees2concat_phylo.svg
# The final tree concat_nonRecomb_KdeFilt_iqtree_GTR+F+ASC_ed.sptree can be conveniently
# visualized and edited with figtree
figtree concat_nonRecomb_KdeFilt_iqtree_GTR+F+ASC_ed.sptree &
Figure 6 depicts the RF-distances of the 15 top-scoring markers to the ML tree inferred under the GTR+F+ASC substitution model from the concatenation of these loci (shown in Fig. 7)
Figure 7 depicts the best ML “species tree” inferred under the GTR+F+ASC substitution model from the supermatrix of 15 top-scoring markers. The nodes are colored according to the legend. The values on the left correspond to approximate Bayes branch support values and those to the right to the UFBoot values described in the manual.
# assumes that you are within test_sequences/pIncAC_homologues/, or in test_sequences/pan_genome
cd pan_CM
# 1. find the pangenome_matrix
ls pangenome_matrix*
# 2. run estimate_pangenome_phylogenies.sh launching 10 independent iqtree searches,
# fitting binary (BIN) models
estimate_pangenome_phylogenies.sh -f pangenome_matrix_t0.fasta -r 10
The estimate_pangenome_phylogenies.sh reports its progress to the screen, informing that: - it created and moved into subdir iqtree_PGM_10_runs, which holds the results of the analysis - the best-fitting binary model was GTR2+FO - after determining the best model, it performs 10 independent IQ-TREE searches - Best IQ-TREE run was: abayes_UFBboot_run9.log:BEST SCORE FOUND : -1897.260 - wrote file best_PGM_IQT_abayes_UFBboot_run9_GTR2+FO.treefile in pan_CM/iqtree_PGM_10_runs/iqtree_10_runs
Lets have a look at the tree search profile
cd iqtree_PGM_10_runs/iqtree_10_runs
cat sorted_lnL_scores_IQ-TREE_searches.out
## abayes_UFBboot_run9.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run8.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run7.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run6.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run5.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run4.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run3.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run2.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run1.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run10.log:BEST SCORE FOUND : -1897.260
No wonder, in such a small dataset, all searches find exactly the same best tree. However, this will be less likely the case as the number of sequences in the dataset increases, as the size of the tree-spece increases factorialy with each new sequence.
figtree best_PGM_IQT_abayes_UFBboot_run9_GTR2+FO.treefile &
Figure 8 displays the best ML pan-genome tree, again using figtree. The nodes are colored according to the legend. The first value corresponds to approximate Bayes branch support values and second ones to the UFBoot values described in the manual.
# assumes that you are within test_sequences/pIncAC_homologues/ or test_sequences/pan_genome
cd pan_CM
# 1. find the pangenome_matrix
ls pangenome_matrix*
# 2. run estimate_pangenome_phylogenies.sh launching 10 independent iqtree searches on
# all available cores (-n 4 in this case), fitting binary (BIN) models
estimate_pangenome_phylogenies.sh -c PARS -R 3 -i pangenome_matrix_t0.phylip -n 4 -b 25 -j 10 -t 1
After changing into the boot_pars/ dir we can edit and visualize the most parsimonious tree found among 100 pars (from the PHYLIP package) searches split on 4 cores (-n 4) with 10 taxon jumples searches using again figtree. See the manual for the details.
cd boot_pars
figtree full_pars_tree_rooted_withBoot_ed.ph &
Figure 9 displays the most parsimonious pan-genome tree, again using figtree. The nodes are colored according to the legend. The nodes are colored according to the legend, which represents standard bootstrap support values computed by seqboot from the PHYLIP package.
If you are running the tutorials from a Docker image instance, do you remember how to exit the container? It’s with exit
That’s it, enjoy!
The code is developed and maintained by Pablo Vinuesa at CCG-UNAM, Mexico and Bruno Contreras-Moreira at EEAD-CSIC, Spain. It is released to the public domain under the GNU GPLv3 license.
Pablo Vinuesa, Luz-Edith Ochoa-Sanchez and Bruno Contreras-Moreira (2018). GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics and inferring pan-genome phylogenies, used for a critical geno-taxonomic revision of the genus Stenotrophomonas. Front. Microbiol. 9:771 | doi: 10.3389/fmicb.2018.00771
Published in the Research Topic on “Microbial Taxonomy, Phylogeny and Biodiversity” http://journal.frontiersin.org/researchtopic/5493/microbial-taxonomy-phylogeny-and-biodiversity
A preprint version is available on bioRxiv
Code Available at https://github.com/vinuesa/get_phylomarkers and released under the GNU GPLv3 license.
We thank Alfredo J. Hernández and Víctor del Moral at CCG-UNAM for technical support.
We gratefully acknowledge the funding provided by DGAPA-PAPIIT/UNAM (grants IN201806-2, IN211814 and IN206318) and CONACyT-Mexico (grants P1-60071, 179133 and A1-S-11242) to Pablo Vinuesa, as well as the Fundación ARAID,Consejo Superior de Investigaciones Científicas (grant 200720I038 and Spanish MINECO (AGL2013-48756-R) to Bruno Contreras-Moreira.