1 Aim of the tutorial
- 1.1 GET_HOMOLOGUES and GET_PHYLOMARKERS source code and documentation
2 Running GET_HOMOLOGUES and GET_PHYLOMARKERS on buluc|tepeu using the Linux $Shell$
- 2.1 Connecting to chaac
- 2.2 Preliminaries: setup your working directory and fetch the test data from the TIB-filo GitHub repository
3 GET_HOMOLOGUES tutorial: computing core- and pan-genomes
4 Estimating core- and pan-genome maximum-likelihood phylogenies with the GET_PHYLOMARKERS suite
5 Developers
6 Citations
7 Source
8 Docker images for GET_HOMOLOGUES and GET_PHYLOMARKERS
9 Documentation
10 Acknowledgements
- 10.1 Personal
- 10.2 Funding

1 Aim of the tutorial

This tutorial will show how to use the get_homologues (Contreras-Moreira & Vinuesa, 2013) and get_phylomarkers (Vinuesa et al. 2018) software packages to perform a thorough pan-genomic and phylogenomic analyses of a small dataset of conjugative multiresistance plasmids of the pIncC (formerly IncA/C) incompatibility group of great concern in the clinical setting Ambrose et al. 2018. This is an updated version of a similar tutorial published by Vinuesa & Contreras-Moreira 2015.

1.1 GET_HOMOLOGUES and GET_PHYLOMARKERS source code and documentation

Both software packages are freely distributed through their respective GitHub repositories:
- get_homologues - GitHub repo
- get_phylomarkers - GitHub repo
get_homologues - GitHub repo can be installed as a BioConda package and run in a conda environment.
Both software packages can be conveniently installed and run as containerized Docker images available from:
- GET_HOMOLOGUES + GET_PHYLOMARKERS Docker image
- GET_PHYLOMARKERS Docker image

The GET_HOMOLOGUES and GET_PHYLOMARKERS packages have extensive documentation and tutorials online
- get_homologues manual
- get_phylomarkers manual

2 Running GET_HOMOLOGUES and GET_PHYLOMARKERS on buluc|tepeu using the Linux $Shell$

Both software packages can be run on any platform using the Docker container, or directly on UNIX and GNU/Linux servers.

In any case, you will have to work using Linux $Shell$ commands. Therefore it is important that you have some proficiency working on the command line.

Here is a complete tutorial on biocomputing in the GNU/Linux environment, with lots of examples.

2.1 Connecting to chaac

The whole exercises will be run on chaac.

ssh username@...

type your password

ssh -X vinuesa@*.*.*.*
vinuesa@*.*.*.*'s password: 
...

[vinuesa@chaac ~]$

2.2 Preliminaries: setup your working directory and fetch the test data from the TIB-filo GitHub repository

The GET_PHYLOMARKERS distribution provides a test_sequences/ directory which holds the subdirectories core_genome/, pan_genome/ and pIncAC/.

The pIncAC/ directory holds the source *.gbk GenBank files. This directory has a README.txt file that briefly describes the GenBank files.

#0.  Welcome to chaac

# processors available
awk '/^proc/ {n++} END{print n}' /proc/cpuinfo

# disk space on file system
df -h

# check system load
htop

#>>>>> 1) mkdir sesion7_pangenomics, cd into it and create symbolic links to the source GenBank files to be processed (12 pIncA/C plasdmis)
#         The paths provided below will only work for users running in their home account on buluc.

# cd to your working directory (e.g. cd ~/TIB-filo)
if [ ! -d sesion7_pangenomics ]; then
    mkdir sesion7_pangenomics && cd sesion7_pangenomics
else
   cd sesion7_pangenomics
fi

# save path in variable top_dir
top_dir=$(pwd)
echo $top_dir

# fetch the data from TIB-filo's GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion7_pangenomica_GET_HOMOLOGUES/data/pIncAC.tgz

if [ -s pIncAC.tgz ]; then
    tar -xzf pIncAC.tgz && rm pIncAC.tgz && cd pIncAC
else
    echo "ERROR! could not find  pIncAC.tgz in $top_dir. Will exit now. 
    exit 1
fi

# list files
ls -l

# how many gbk files do we have?
ls *gbk | wc -l

# peak at one of them using head and tail
head -50 _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk
tail _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk

# How many CDSs do these gbk files have? - identify the genome with the lowest number of CDSs
grep -c ' CDS ' *gbk | sed 's/:/\t/' | sort -nk2

# save the absolute path of this directory for easy reference later on
gbk_dir=$(pwd)

# return to the parent directory and save its full path to the variable top_dir for easy reference
# cd ..
cd $top_dir

# where am I?
pwd

# list; you should see the pIncAC/ directory holding the *gbk files
ls

3 GET_HOMOLOGUES tutorial: computing core- and pan-genomes

In this section we will compute consensus core- and pan-genomes of the pIncA/C dataset as the intersection of clusters computed by the tree clustering algorithms implemented in get_homologues, as shown in the flowchart below:

Figure 1. The GET_HOMOLOGUES pipeline depicted as a flowchart

3.1 GET_HOMOLOGUES options

GET_HOMOLOGUES is a sophisticated piece of software, with several external dependencies and auxiliary scripts, as depicted in Fig. 1.

To run GET_HOMOLOGUES on chaac we need to start the corresponding conda environment

# Activate conda's get_homologues environment
conda activate get_homologues

3.1.1 Checking that all dependencies are in place

To make sure that get_homlogues.pl is in PATH and that the kit is complete type the following command

get_homologues.pl -v

# 0. make sure get_homlogues.pl in PATH and that the kit is complete

/export/apps/bioconda/envs/get_homologues/bin/get_homologues.pl version 14112024


Program written by Bruno Contreras-Moreira (1) and Pablo Vinuesa (2).

 1: http://www.eead.csic.es/compbio (Estacion Experimental Aula Dei/CSIC/Fundacion ARAID, Spain)
 2: http://www.ccg.unam.mx/~vinuesa (Center for Genomic Sciences, UNAM, Mexico)

Primary citations:

 Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and
 robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696-701. (PubMed:24096415)

 Vinuesa P and Contreras-Moreira B (2015) Robust Identification of Orthologues and Paralogues for 
 Microbial Pan-Genomics Using GET_HOMOLOGUES: A Case Study of pIncA/C Plasmids. In Bacterial Pangenomics,
 Methods in Molecular Biology Volume 1231, 203-232, edited by A Mengoni, M Galardini and M Fondi.

This software employs code, binaries and data from different authors, please cite them accordingly:
 OrthoMCL v1.4 (www.orthomcl.org , PubMed:12952885)
 COGtriangles v2.1 (sourceforge.net/projects/cogtriangles , PubMed=20439257)
 NCBI Blast-2.2 (blast.ncbi.nlm.nih.gov , PubMed=9254694,20003500)
 Bioperl v1.5.2 (www.bioperl.org , PubMed=12368254)
 HMMER 3.1b2 (hmmer.org)
 Pfam (http://pfam.xfam.org , PubMed=26673716)
 Diamond v0.8.25 (https://github.com/bbuchfink/diamond, PubMed=25402007)

Checking required binaries and data sources, all set in phyTools.pm :
        EXE_BLASTP : OK (path:blastp)
        EXE_BLASTN : OK (path:blastn)
      EXE_FORMATDB : OK (path:makeblastdb)
           EXE_MCL : OK (path:mcl)
      EXE_MAKEHASH : OK (path:COGmakehash )
     EXE_READBLAST : OK (path:COGreadblast )
        EXE_COGLSE : OK (path:COGlse )
        EXE_COGTRI : OK (path:COGtriangles )
       EXE_HMMPFAM : OK (hmmscan --noali --acc --cut_ga  /export/apps/bioconda/envs/get_homologues/bin/db/Pfam-A.hmm)
         EXE_DMNDP : OK (path:diamond blastp)
         EXE_DMNFT : OK (path:diamond makedb)
        EXE_INPARA : OK (path:/export/apps/bioconda/envs/get_homologues/bin/_cluster_makeInparalog.pl)
         EXE_ORTHO : OK (path:/export/apps/bioconda/envs/get_homologues/bin/_cluster_makeOrtholog.pl)
         EXE_HOMOL : OK (path:/export/apps/bioconda/envs/get_homologues/bin/_cluster_makeHomolog.pl)
    EXE_SPLITBLAST : OK (path:/export/apps/bioconda/envs/get_homologues/bin/_split_blast.pl)
  EXE_SPLITHMMPFAM : OK (path:/export/apps/bioconda/envs/get_homologues/bin/_split_hmmscan.pl)

3.1.2 Running the main $script$ get_homologues.pl

The main pipeline depicted above (Fig. 1) is run with the core $script$ get_homologues.pl.

The options of the $script$ are displayed by simply calling it with any of the following commands

get_homologues.pl
get_homologues.pl -h

usage: /export/apps/get_homologues/get_homologues.pl [options]

-h this message
-v print version, credits and checks installation
-d directory with input FASTA files ( .faa / .fna ),           (overrides -i,
   GenBank files ( .gbk ), 1 per genome, or a subdirectory      use of pre-clustered sequences
   ( subdir.clusters / subdir_ ) with pre-clustered sequences   ignores -c, -g)
   ( .faa / .fna ); allows for new files to be added later;    
   creates output folder named 'directory_homologues'
-i input amino acid FASTA file with [taxon names] in headers,  (required unless -d is set)
   creates output folder named 'file_homologues'

Optional parameters:
-o only run BLAST/Pfam searches and exit                       (useful to pre-compute searches)
-c report genome composition analysis                          (follows order in -I file if enforced,
                                                                ignores -r,-t,-e)
-R set random seed for genome composition analysis             (optional, requires -c, example -R 1234,
                                                                required for mixing -c with -c -a runs)
-m runmode [local|cluster]                                     (default local)
-n nb of threads for BLAST/HMMER/MCL in 'local' runmode        (default=2)
-I file with .faa/.gbk files in -d to be included              (takes all by default, requires -d)

Algorithms instead of default bidirectional best-hits (BDBH):
-G use COGtriangle algorithm (COGS, PubMed=20439257)           (requires 3+ genomes|taxa)
-M use orthoMCL algorithm (OMCL, PubMed=12952885)

Options that control sequence similarity searches:
-X use diamond instead of blastp                               (optional, set threads with -n)
-C min %coverage in BLAST pairwise alignments                  (range [1-100],default=75)
-E max E-value                                                 (default=1e-05,max=0.01)
-D require equal Pfam domain composition                       (best with -m cluster or -n threads)
   when defining similarity-based orthology
-S min %sequence identity in BLAST query/subj pairs            (range [1-100],default=1 [BDBH|OMCL])
-N min BLAST neighborhood correlation PubMed=18475320          (range [0,1],default=0 [BDBH|OMCL])
-b compile core-genome with minimum BLAST searches             (ignores -c [BDBH])

Options that control clustering:
-t report sequence clusters including at least t taxa          (default t=numberOfTaxa,
                                                                t=0 reports all clusters [OMCL|COGS])
-a report clusters of sequence features in GenBank files       (requires -d and .gbk files,
   instead of default 'CDS' GenBank features                    example -a 'tRNA,rRNA',
                                                                NOTE: uses blastn instead of blastp,
                                                                ignores -g,-D)
-g report clusters of intergenic sequences flanked by ORFs     (requires -d and .gbk files)
   in addition to default 'CDS' clusters
-f filter by %length difference within clusters                (range [1-100], by default sequence
                                                                length is not checked)
-r reference proteome .faa/.gbk file                           (by default takes file with
                                                                least sequences; with BDBH sets
                                                                first taxa to start adding genes)
-e exclude clusters with inparalogues                          (by default inparalogues are
                                                                included)
-x allow sequences in multiple COG clusters                    (by default sequences are allocated
                                                                to single clusters [COGS])
-F orthoMCL inflation value                                    (range [1-5], default=1.5 [OMCL])
-A calculate average identity of clustered sequences,          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-P calculate percentage of conserved proteins (POCP),          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-z add soft-core to genome composition analysis                (optional, requires -c [OMCL|COGS])

 This program uses BLAST (and optionally HMMER/Pfam) to define clusters of 'orthologous'
 genomic sequences and pan/core-genome gene sets. Several algorithms are available
 and search parameters are customizable. It is designed to process (in a HPC computer
 cluster) files contained in a directory (-d), so that new .faa/.gbk files can be added
 while conserving previous BLAST results. In general the program will try to re-use
 previous results when run with the same input directory.

For the details of each option, please have a look a the extensive get_homologues manual. Here we can review only some of the most important ones.

3.2 Running GET_HOMOLOGUES with different clustering algorithms to compute consensus core- and pan-genomes

GET_HOMOLOGUES allows users to run 3 different clustering algorithms ($BDBH$, $COGtriangles$ and $OMCL$) to define homologous gene families, as defined in Table 1.

Table 1. Overview of the three clustering algorithms implemented in get_homologues

name	option	comment
BDBH	default	Starting from a reference genome, keep adding genomes stepwise, while storing the sequence clusters that result of merging the best bidirectional best hits
COGS	-G	Merges triangles of inter-genomic symmetrical best matches, as described in PubMed=20439257
OMCL	-M	OrthoMCL v1.4, uses the Markov Cluster Algorithm to group sequences, with inflation (-F) controlling cluster granularity, as described in PubMed=12952885

To learn all details around these algorithms, please read the online GET_HOMOLOGUES manual

3.2.1 The sequence data for get_homologues

The only required option is either -i, used to choose an input file, or preferably -d instead, which indicates an input folder.

In previous versions only files with extensions .fa[a] and .gb[k] were considered when parsing the -d directory. Currently, GZIP- or BZIP2-compressed input files are also accepted!

By using .faa input files in theory you might only calculate clusters of protein sequences. In contrast, the advantage of using .gbk files is that you obtain both nucleotide and protein clusters. If both types of input files are combined, only protein clusters will be produced. However, if each input .faa file has a twin .fna file in place, containing the corresponding nucleotide sequences in the same order, the program will attempt to produce the corresponding clusters of nucleotide sequences. The possible input file combinations are summarized in Table 1:

Table 2: Valid input file combinations.

input file extensions	output clusters
.gbk	amino acid + DNA sequence
.faa	amino acid sequence
.gbk & .faa	amino acid sequence
.faa & .fna	amino acid + DNA sequence
.gbk & .faa & .fna	amino acid + DNA sequence

In summary, the use of an input folder or directory (-d) containing GenBank files is recommended, as it allows for new files to be added there in the future, reducing the computing required for updated analyses. For instance, if a user does a first analysis with 5 input genomes today, it is possible to check how the resulting clusters would change when adding an extra 10 genomes tomorrow, by copying these new 10 .faa / .gbk input files to the pre-existing -d folder, so that all previous BLAST searches are re-used.

To run $get\_homologues.pl\ -d$ you need to be in the directory on top of the one holding the GenBank files. In our case this is the one saved in $\$top\_dir$, holding the $pIncAC$ directory with the *gbk files.

3.2.2 Computing BDBH clusters with GET_HOMOLOGUES (default clustering mode)

The BDBH algorithm implemented in GET_HOMOLOGUES is a heuristic version that does NOT cluster the results of the all-against-all blast data, just those against the smallest genome (default) or a ($-R\ refGenome$), as shown in Fig. 2. Therefore it is not suitable for computing pan-genomes. Use it only for core-genome calculations!

Figure 2. GET_HOMOLOGUE’s implementation of the BDBH algorithm using a reference genome

Let’s instruct get_homologues.pl to compute the BDBH clusters for the 12 genomes. Make sure you are in $\$top\_dir$ and issue the following command, which will compute a core-genome based on the BDBH (default) clustering algorithm, excluding inparaloges (-e)

get_homologues.pl -d pIncAC -t 12 -e -n 2 #

Note: the command above is equivalent to get_homologues.pl -d pIncAC -e -n 2, as get_homologues.pl will by default use -t = number_of_genomes available in the $\$gbk\_dir$

# 1. run GET_HOMOLOGUES to compute the set of core-genome clusters using the BDBH, 
# get_homologues.pl -d pIncAC &> log.BDBH_def &
get_homologues.pl -d pIncAC -t 12 -e -n 2  # BDBH clusters containing sequences for the 12 genomes (-t 12) 
                                           # excluding those with inparalogues (-e): 
                                           # use 2 processors for the job
                                           # This will take ~ 3 minutes

# version 14112024
# results_directory=/home/vinuesa/pangenomics/pIncAC_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1
# diamond job:0

# checking input files...
# Salmonella_enterica_33676_pIncAC.gbk 185
# Salmonella_enterica_YU39_pIncAC.gbk 178
# Salmonella_enterica_pAM04528.gbk 180
# Salmonella_enterica_sv_Kentucky.gbk 164
# _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158
# _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173
# _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172
# _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189
# _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224
# _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148
# _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173
# _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137

# 12 genomes, 2081 sequences

# taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9

# mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_ (_algBDBH)

# running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_33676_pIncAC.gbk.fasta

# running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_YU39_pIncAC.gbk.fasta

# ...

# running BLAST searches ...
# done
# ...

# concatenating and sorting BLAST/DIAMOND results...
# sorting _Salmonella_enterica_33676_pIncAC.gbk results (0.097MB)
# sorting _Salmonella_enterica_YU39_pIncAC.gbk results (0.091MB)
# sorting _Salmonella_enterica_pAM04528.gbk results (0.1MB)
# clustering inparalogues in _Escherichia_coli_plasmid_peH4H_NC_012690.gbk
# ...
# done

# parsing blast result! (/home/vinuesa/cursos/pangenomics_PDCBq_Nov20/test_sequences/pIncAC_homologues/tmp/all.blast , 1.1MB)
# parsing file finished

# creating indexes, this might take some time (lines=2.25e+04) ...

# construct_taxa_indexes: number of taxa found = 12
# number of file addresses/BLAST queries = 2.1e+03

# clustering orthologous sequences

# clustering inparalogues in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk (reference)
# 0 sequences

# clustering inparalogues in Salmonella_enterica_33676_pIncAC.gbk
# 4 sequences

# finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_33676_pIncAC.gbk
# 59 sequences

# clustering inparalogues in Salmonella_enterica_YU39_pIncAC.gbk
# 3 sequences

# finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_YU39_pIncAC.gbk
# 84 sequences

# ...

# looking for valid ORF clusters (n_of_taxa=12)...


# number_of_clusters = 33
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_

# runtime: 218 wallclock secs ( 4.40 usr  0.75 sys + 174.83 cusr 17.86 csys = 197.84 CPU)
# RAM use: 33.7 MB

The results of the previous and following get_homologue.pl runs can be found in the pIncAC_homologues/ directory, which includes the all-against-all blast results.

ls -1d *

pIncAC
pIncAC_homologues

We will explore its contents later on.

Let’s now compute the pan-genome clusters with the COGtriangles and OMCL algorithms, which will use the blastp results already available in pIncAC_homologues/ from the previous BDBH run.

3.2.3 Clustering sequences with the COGtriangles algorithm ($-G$)

Run with $-t\ 0$ to compute clusters of all sizes, which is required for the pan-genome analysis.

get_homologues.pl -d pIncAC -G -t 0

get_homologues.pl -d pIncAC -G -t 0  # COGtriangles, computing clusters of all sizes (-t 0)
                                     # finds 408 clusters in 14 wallclock secs, as it recycles
                                     # the blast results of the previous run.
                                     
                                     
# version 14112024
# results_directory=/home/vinuesa/pangenomics/pIncAC_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1
# diamond job:0

# checking input files...
# Salmonella_enterica_33676_pIncAC.gbk 185
# Salmonella_enterica_YU39_pIncAC.gbk 178
# Salmonella_enterica_pAM04528.gbk 180
# Salmonella_enterica_sv_Kentucky.gbk 164
# _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158
# _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173
# _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172
# _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189
# _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224
# _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148
# _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173
# _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137

# 12 genomes, 2081 sequences

# taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9

# mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_ (_algCOG)

# skipped genome parsing (pIncAC_homologues/tmp/selected.genomes)


# running BLAST searches ...
# done


# parsing file (COG)
# parsing file (COG) finished

# creating indexes, this might take some time (lines=2.25e+04) ...

# construct_taxa_indexes: number of taxa found = 12
# number of file addresses/BLAST queries = 2.1e+03

# clustering orthologous sequences
# checking lineage-specific expansions
# making COGs
# prunning COGs
# done

# add_unmatched_singletons : 0 sequences, 0 taxa


# looking for valid ORF clusters (n_of_taxa=0)...


# number_of_clusters = 408
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_

# runtime: 15 wallclock secs ( 0.81 usr  0.14 sys +  2.46 cusr  0.58 csys =  3.99 CPU)
# RAM use: 28.3 MB

3.2.4 Clustering sequences with the OMCL algorithm ($-M$) and compute pan-genome composition ($-c$)

Run with $-t\ 0$ to compute clusters of all sizes, which is required for the pan-genome analysis

Option $-c$ for generating pan-genome composition tables to be used for plotting the core decay and pan-genome growth curveas as a function of increasing number of genomes.

get_homologues.pl -d pIncAC/ -M -t 0 -c

get_homologues.pl -d pIncAC/ -M -t 0 -c     # OMCL, finds 393 clusters in 5 wallclock secs.

# ...
# identifying inparalogs in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk
# 0 sequences

# running MCL (inflation=1.5) ...
# running MCL finished

# find_OMCL_clusters: parsing clusters (/home/vinuesa/pangenomics/pIncAC_homologues/tmp/all_ortho.mcl)


# adding _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk: core=32 pan=336

# pan-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=pIncAC_homologues/pan_genome_algOMCL.tab
genomes mean    stddev  |   samples
0   167 22  |   172 170 183 137 164 217 170 169 137 154 
1   215 22  |   215 206 184 203 239 262 229 208 200 201 
2   242 15  |   246 245 236 247 258 269 242 243 216 220 
3   273 22  |   274 281 292 282 301 288 270 262 217 263 
4   291 21  |   295 322 294 288 301 312 307 268 251 268 
5   307 15  |   300 329 302 307 303 317 315 315 270 309 
6   316 17  |   302 335 307 307 303 336 334 323 282 330 
7   317 18  |   303 336 307 312 303 340 335 324 282 331 
8   324 17  |   340 336 317 320 309 341 335 324 282 331 
9   330 10  |   343 337 336 321 312 341 335 331 314 331 
10  334 9   |   343 345 336 321 331 341 343 332 317 331 
11  341 4   |   348 346 339 340 338 341 344 340 335 336 

# core-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=pIncAC_homologues/core_genome_algOMCL.tab
genomes mean    stddev  |   samples
0   167 22  |   172 170 183 137 164 217 170 169 137 154 
1   112 22  |   123 106 165 97  134 116 89  90  94  109 
2   93  11  |   90  83  105 94  117 81  83  88  92  95  
3   75  8   |   74  62  83  81  74  74  62  72  91  75  
4   65  13  |   55  58  78  57  72  53  43  69  89  74  
5   57  13  |   53  57  75  49  69  53  35  42  69  72  
6   49  10  |   53  42  54  49  61  41  35  34  66  51  
7   44  8   |   53  35  46  43  61  40  35  34  41  51  
8   42  6   |   51  34  44  41  45  40  35  34  40  51  
9   38  5   |   50  34  33  41  40  39  34  34  35  43  
10  36  4   |   42  32  33  40  32  39  32  34  34  42  
11  32  0   |   32  32  32  32  32  32  32  32  32  32  


# add_unmatched_singletons : 117 sequences, 12 taxa


# looking for valid ORF clusters (n_of_taxa=0)...


# number_of_clusters = 393
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_

# runtime:  3 wallclock secs ( 1.82 usr  0.13 sys +  1.79 cusr  0.26 csys =  4.00 CPU)
# RAM use: 30.3 MB

3.2.4.1 Plotting core- and pan-genome decay/growth curves with plot_pancore_matrix.pl

Remember that for the OMCL clustering we ran the following code:

get_homologues.pl -d pIncAC/ -M -t 0 -c

The $-c$ flag tells get_homologues.pl to compute the core- and pan-genome composition data by randomly sampling 10 times pairs of genomes to compute the observed numbers of common and new genes. The results are written to the core_genome_algOMCL.tab and pan_genome_algOMCL.tab files, found in the pIncAC_homologues results directory. These tables are used as input for plot_pancore_matrix.pl, which renders the graphics and fits different functions to the data.

The plot_pancore_matrix.pl options
- plot_pancore_matrix.pl -h

[options]: 
-h   this message
-i   input .tab data file                   (required, generated by get_homologues.pl)
-f   type of fit [pan|core_Tettelin|core_Willenbrock|core_both]
                                            (default: core_Tettelin, PubMed:16172379,18088402)
-F   font scale for fitted formulas         (optional, default=0.8)
-a   save snapshots for animations in dir   (optional, example: -a animation)

Let us cd into pIncAC_homologues/, locate the core_genome_algOMCL.tab, and pan_genome_algOMCL.tab files, and compute the graphs with plot_pancore_matrix.pl

cd pIncAC_homologues/

# save the absolute path to this directory in the blast_dir variable
blast_dir=$(pwd)

# these are the relevant input files: core_genome_algOMCL.tab and pan_genome_algOMCL.tab
ls *.tab

# plot core-decay curves (Tettelin & Willenbrock fits)
plot_pancore_matrix.pl -i core_genome_algOMCL.tab -f core_both 

# plot pan-genome growth curves (Tettelin's exponential fit)
plot_pancore_matrix.pl -i pan_genome_algOMCL.tab -f pan

# if you logged in with ssh -X, you may use eog (Eye of GNOME) or firefox to display the png files
#  Note: the & at the end of the command runs it in the background
eog core_genome_algOMCL.tab_core_both.png &
eog pan_genome_algOMCL.tab_pan.png &

Figure 3. The core-genome decay curves for 12 pIncA/C plasmids

Figure 4. The pan-genome growth curve for 12 pIncA/C plasmids

Based on the output shown above, what type of pan-genome (closed or open) do you think best represents this collection of pIncA/C plasmids?

3.3 Configuring a batch script to run multiple get_homologues.pl analyses in non-interactive mode for large datasets

Before exploring the contents of the get_homologues.pl runs stored in the [pIncAC]_homologues directory, a brief note on how to setup multiple sequential get_homologues.pl runs in non-interactive mode. This is the best option to run large datasets which may take hours or days to finish.

This is easy. just type the get_homologues.pl invocations that you would like to run into a text file ($script$), which i have named run_get_hom.batch holding the code shown below:

Contents of the run_get_hom.batch $script$

get_homologues.pl -d pIncAC -t 12 -e -n 2 &> log.BDBH
get_homologues.pl -d pIncAC -G -t 0 &> log.COG
get_homologues.pl -d pIncAC/ -M -t 0 -c &> log.OMCL

The run_get_hom.batch $script$ would run exactly the same analyses that we have run interactively, but in a non-interactive mode.

Note that the output that each get_homologues.pl invocation would normally print to STDOUT (the screen) is redirected to a file called log.[ALGORITHM], making use of the “&>” syntax which instructs to redirect STDOUT and STDERR to the file indicated after the symbols.

3.3.1 Running a batch command $script$ with $nohup$ in the background

Now we can call the batch script in the following manner:

nohup bash run_get_hom.batch &> /dev/null &

The $nohup$ (no hang-up) command combined with the strange-looking “&> /dev/null &” tells the server to run the script in the background and to make the run independent of the login session. This means that after issuing the command, the script runs detached from our terminal. We can shutdown our computer and the process will continue executing on the server.

After some time, we can check the results of the runs by listing the files and directories as shown:

ls -1
log.BDBH
log.COG
log.OMCL
pIncAC
pIncAC_homologues
run_get_hom.batch

Another useful command to know is tail -f log.BDBH, which can be called on any of the logfiles while they are being written to disk, showing their last 10 lines as they’re appended to the tail of the file in real time. Nice!

3.4 Explore the results of get_homologues.pl runs stored in the [pIncAC]_homologues directory

As previously mentioned, GET_HOMOLOGUES writes the results in the **[pIncAC]_homologues/ directory**, stored in the $\$blast\_dir$ variable.

Therein you will find the following files and directories

Regular files
- genome-specific CDSs: .gbk.fna
- genome-specific proteomes: .gbk.faa
- genome-specific proteomes wit simplified header for blastp analyses: .gbk.fasta
- genome-specific blastp databases: .gbk.fasta.phr .gbk.fasta.pin .gbk.fasta.psq
- pair-wise (all-against-all) blastp results as gzip compressed files: genome1.gbk.fasta_genome2.gbk.fasta.blast.gz
- pan-genome composition tables (when using option $-c$: core_genome_algOMCL.tab pan_genome_algOMCL.tab
Directories
- tmp: which holds diverse log-files, combined and sorted blast-tables, genome indices …
- run specific directories like KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_, holding the core and|or pan-genome clusters at the DNA (.fna) and protein (.faa) levels

The run-specific output directories are named using the following components to make them unique for each run configuration:

BuchaphCc_f0_alltaxa_algBDBH_e0_
|         |  |       |       |   
|         |  |       |       -e option was not used (inparalogues are in)
|         |  |       the clustering algorithm is BDBD (default)
|         |  all clusters contain at least 1 sequence from each taxa (default -t behavior)
|         -f option not used (no length filtering)
reference proteome

Let’s explore the results directory pIncAC_homologues/ using basic $Shell$ commands:

# assumes that we are in $blast_dir (the results directory pIncAC_homologues/)
cd $blast_dir

# list
ls | less

ls *faa *fna *fasta
ls *faa *fna *fasta | wc

head -5 Salmonella_enterica_YU39_pIncAC.gbk.f*a

ls Salmonella_enterica_YU39_pIncAC.gbk.f*a.*

# find the subdirectories holding the run-specific clusters of homologous sequences (gene families)
find . -type d

# exlore results (clusters) in the KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_
#   which hods the BDBH results for the 12 genomes (-t 12), without inparalogues (-e)
ls

ls *fna | wc -l
ls *faa | wc -l

grep -c '>' *fna
grep -c '>' *faa

3.5 Computing a consensus core-genome from the BDBH, COGtriangles and OMCL clustering solutions using compare_clusters.pl

GET_HOMOLOGUES is unique among available pangenomics software in many respects. One of them is its capacity to compute consensus core-genome clusters from the solutions generated by the three clustering algorithms it implements, and consensus pan-genome clusters from COGtriangles and OMCL clustering results, as demonstrated below.

The compare_clusters.pl options
- compare_clusters.pl -h

[options]: 
-h   this message
-d   comma-separated names of cluster directories                  (min 1 required, example -d dir1,dir2)
-o   output directory                                              (required, intersection cluster files are copied there)
-n   use nucleotide sequence .fna clusters                         (optional, uses .faa protein sequences by default)
-r   take first cluster dir as reference set, which might contain  (optional, by default cluster dirs are expected
     a single representative sequence per cluster                   to be derived from the same taxa; overrides -t,-I)
-s   use only clusters with syntenic genes                         (optional, parses neighbours in FASTA headers)
-t   use only clusters with single-copy orthologues from taxa >= t (optional, default takes all intersection clusters; example -t 10)
-I   produce clusters with single-copy seqs from ALL taxa in file  (optional, example -I include_list; overrides -t)
-m   produce intersection pangenome matrices                       (optional, ideally expects cluster directories generated
                                                                    with get_homologues.pl -t 0)
-x   produce cluster report in OrthoXML format                     (optional)
-T   produce parsimony-based pangenomic tree

Compute consensus core-genome clusters using compare_clusters.pl

# Compute consensus core-genome clusters using compare_clusters.pl of the GET_HOMOLOGUES suite.
# Note that we will run the compare_clusters.pl script twice, once with the -n flag 
# (to compute the consesus clusters at the DNA level, producing *.fna files)
# and a second instance without the flag, to get the protein clusters (*.faa files)

# assumes we are in $blast_dir
cd $blast_dir

# find the subdirectories holding the run-specific results
find . -type d

# Compute the core genome; use -t 12 to keep only clusters containing a copy of the gene in each of the 12 genomes analyzed
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12 -n

# repeat command, without -n, to compute the clusters at the protein level (*faa files)
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12

# Lets explore the core_BCM directory
cd core_BCM

ls *fna | wc -l
ls *faa | wc -l

grep -c '>' *faa


# confirm that all 31 clusters contain only one sequence from each plasmid/proteome
grep '>' *faa | cut -d\| -f2,3 | sort | uniq -c

# display the Venn diagram of the consensus core-genome clusters
ls *svg

eog venn_t12.svg &

Figure 5. Venn diagram depicting the consensus core-genome of 31 genes for the 12 input pIncA/C sequences.

3.6 Computing a consensus pan-genome with GET_HOMOLOGUES from the COGtriangles and OMCL clusters and visualization of the pangenome’s structure with phylogenies and graphs

To compute a consensus pan-genome, only the $COGtriangles$ and $OMCL$ algorithms should be used! as they don’t require a reference sequence.

Run compare_clusters.pl with the $-t\ 0$ option, to consider clusters of all occupancy sizes (from singletons to multigene families), as demonstrated below.

# Compute consensus pan-genome clusters and matrix using compare_clusters.pl of the 
# GET_HOMOLOGUES suite.
# Note that we run the compare_clusters.pl script twice, once with the -n and -m flags 
# (to compute the consesus clusters at the DNA level, *.fna files
# and the pan-genome matrix) and a second instance without these flags, 
# to get the protein clusters (*.faa files). 
# Note also that we exclude the directory holding the BDBH clusters, 
# as these are not suitable to compute a proper pan-genome matrix, since the 
# BDBH clusters always contain the reference sequence, missing those that do not contain it.

cd $blast_dir

# get the consensus COG-OMCL pan-genome clusters with -t 0 (all sizes) and -m to write pan-genome matrix
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM -m -n -T

# repeat without -m and -n, to generate the protein clustes
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM

# Explore the pan-genome clusters
cd pan_CM

# how many clusters
ls *faa | wc -l

# print a sorted list of cluster sizes (gene contents of each cluster)
grep -c '>' *faa | sed 's/:/\t/' | sort -nk2
grep -c '>' *faa | sed 's/:/\t/' | sort -nrk2 > pan-genome_cluster_compositon_stats.tsv

# lets explore the structure of the pan-genome
grep -c '>' *faa | sed 's/:/\t/' | cut -f2 | sort -n | uniq -c | awk 'BEGIN{OFS="\t"; print "size\tcount"}{print $2,$1}'


# Lets explore the structure of the pan-genome parsimony tree pangenome_matrix_t0.phylip.ph

figtree pangenome_matrix_t0.phylip.ph &



                    /---+  Escherichia coli plasmid pAPEC1990 61 NC 019066 gbk...                             
                 /--+                                                                                         
                 |  \+  Escherichia coli UMNK88 plasmid pUMNK88 NC 017645 gbk...                              
                 |                                                                                            
                /+  /---+  Escherichia coli plasmid pPG010208 NC 019065 gbk...                                
                ||  |                                                                                         
                |\--+      /------------+  Aeromonas hydrophila plasmid pRA1 NC 012885 gbk...                 
                |   |      |                                                                                  
                |   \------+        /------------+  Klebsiella pneumoniae plasmid pNDM KN NC 019153 gbk...    
                |          |  /-----+                                                                         
 /--------------+          \--+     \--------------+  Escherichia coli plasmid pNDM 1 Dok01 NC 018994 gbk...  
 |              |             |                                                                               
 |              |             \-------+ Salmonella enterica sv Kentucky gbk...                                
 |              |                                                                                             
 |              |     /---------+  Escherichia coli plasmid peH4H NC 012690 gbk...                            
 |              |  /--+                                                                                       
=+              \--+  \  Escherichia coli plasmid pAR060302 NC 012692 gbk...                                  
 |                 |                                                                                          
 |                 \+ Salmonella enterica pAM04528 gbk...                                                     
 |                                                                                                            
 +-------+ Salmonella enterica YU39 pIncAC gbk...                                                             
 |                                                                                                            
 \--------------+ Salmonella enterica 33676 pIncAC gbk...                                                     
                                                                                                              
 |------|------|-----|------|------|------|-----|---                                                          
 0     25     50    75    100    125    150   175                                                             
 substitutions/site



# display the Venn diagram of the consensus pan-genome clusters
ls *svg
eog venn_t0.svg &

Figure 6. Venn diagram depicting the consensus pan-genome clusters for the 12 input pIncA/C sequences contains 373 cluster. The corresponding pan-genome matrix is provided in the test_sequences/pan_genome/ directory

3.7 Graphical displays of the core and pan-genome structure with the auxiliary $script$ parse_pangenome_matrix.pl

Graphical analysis of pangenome’s structure and parsing pan-genome tables to identify lineage-specific genes or family expansions can be conveniently done with the auxiliary $script$ parse_pangenome_matrix.pl, as shown below.

parse_pangenome_matrix.pl options
- parse_pangenome_matrix.pl -h

parse_pangenome_matrix.pl 

[options]: 
-h   this message
-m   input pangenome matrix .tab                               (required, made by compare_clusters.pl)
-s   report cloud,shell,soft core and core clusters            (optional, creates plot if R is installed)
-l   list taxa names present in clusters reported in -m matrix (optional, recommended before using -p option)
-x   produce matrix of intersection pangenome clusters         (optional, requires -s)
-I   use only taxon names included in file                     (optional, ignores -A,-g,-e)
-A   file with taxon names (.faa,.gbk,.nucl files) of group A  (optional, example -A clade_list_pathogenic.txt)
-B   file with taxon names (.faa,.gbk,.nucl files) of group B  (optional, example -B clade_list_symbiotic.txt)
-a   find genes/clusters which are absent in B                 (optional, requires -B)
-g   find genes/clusters present in A which are absent in B    (optional, requires -A & -B)
-e   find gene family expansions in A with respect to B        (optional, requires -A & -B)
-S   skip clusters with occupancy <S                           (optional, requires -x/-a/-g, example -S 2)
-P   percentage of genomes that must comply presence/absence   (optional, default=100, requires -g)
-p   plot pangenes on the genome map of this group A taxon     (optional, example -p 'Escherichia coli K12',
                                                                requires -g, -A & -B, only works with clusters
                                                                derived from input files in GenBank format)

3.7.1 Plotting cloud, shell, soft core and core clusters with parse_pangenome_matrix.pl (-m pangenome_matrix_t0.tab -s)

Use option -s to report and graphically depict cloud, shell, soft core and core clusters

parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -s

parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -s

# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -I  -A  -B  -a 0 -g 0 -e 0 -p  -s 1 -l 0 -x 0 -P 100 -S 0

# matrix contains 373 clusters and 12 taxa

# cloud size: 183 list: pangenome_matrix_t0__cloud_list.txt
# shell size: 130 list: pangenome_matrix_t0__shell_list.txt
# soft core size: 60 list: pangenome_matrix_t0__softcore_list.txt
# core size: 31 (included in soft core) list: pangenome_matrix_t0__core_list.txt

# using default colors, defined in %COLORS

# globals controlling R plots: $YLIMRATIO=1.2

# shell bar plots: pangenome_matrix_t0__shell.png , pangenome_matrix_t0__shell.pdf
# shell circle plots: pangenome_matrix_t0__shell_circle.png , pangenome_matrix_t0__shell_circle.pdf

# pan-genome size estimates (Snipen mixture model PMID:19691844): pangenome_matrix_t0__shell_estimates.tab
Core.size Pan.size BIC LogLikelihood
2 components 31 375 2897.63753894041 -1439.93640184074
3 components 24 436 1648.86318314942 -809.627645525601
4 components 21 507 1626.68849010487 -792.618720583683  <== best BIC VALUE!
5 components 18 508 1638.26140215658 -792.483598189895
6 components 0 508 1649.91033432187 -792.386485852895
7 components 6 508 1661.76417317612 -792.391826860376
8 components 6 508 1673.59937473543 -792.387849220388
9 components 10 507 1685.47784762243 -792.405507244245
10 components 20 505 1697.57598545494 -792.532997740852
Sample 31 373 NA NA


# occupancy stats:
    cloud   shell   soft_core   core
Salmonella_enterica_33676_pIncAC.gbk    52  46  59  31
Salmonella_enterica_YU39_pIncAC.gbk 43  56  60  31
Salmonella_enterica_pAM04528.gbk    5   92  59  31
Salmonella_enterica_sv_Kentucky.gbk 15  78  58  31
_Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk    27  68  53  31
_Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk  5   90  60  31
_Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk    4   88  60  31
_Escherichia_coli_plasmid_pAR060302_NC_012692.gbk   3   105 60  31
_Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk    51  98  60  31
_Escherichia_coli_plasmid_pPG010208_NC_019065.gbk   9   70  59  31
_Escherichia_coli_plasmid_peH4H_NC_012690.gbk   4   86  53  31
_Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk    28  51  50  31


# The command produces also lists of genes in each pan-genome "compartment"
ls *list.txt


# >>>>>> Textual exploration of the pangenome_matrix_t0__*_list.txt
#  inspect the pangenome_matrix_t0__*_list.txt for the presence plasmid replication and mobilization genes
grep -Ei 'mob|tra|rep' pangenome_matrix_t0*txt | egrep -v 'transpo|trans' 

# 9.5 inspect the pangenome_matrix_t0__*_list.txt for the presence of 
# beta-lactamase or tetracycline resistance genes
grep -Ei 'bla|NDM|tet|efflux|resist|aminogly|sul|aad|RND' *txt


# 10. Display the barplot depicting the pan-genome structure of 12 pIncA/C plasmids
eog pangenome_matrix_t0__shell.png &

Figure 7. Barplot of the pan-genome structure of 12 pIncA/C plasmids

3.8 Identifying lineage-specific genes using parse_pangenome_matrix.pl

The final GET_HOMOLOGUES feature I’m gonna show is how to parse the pan-genome matrix to identify lineage-specific genes and family expansions.

As shown below on the pan-genome parsimony tree we computed previously with compare_clusters.pl -T, there are two plasmids labeled as pNDM. These contain the “New Dehli metallo-betaclactamase” NDM1, which confers resistance to essentially all beta-lactams, including carbepenems, except aztreonam.


                    /---+  Escherichia coli plasmid pAPEC1990 61 NC 019066 gbk...                             
                 /--+                                                                                         
                 |  \+  Escherichia coli UMNK88 plasmid pUMNK88 NC 017645 gbk...                              
                 |                                                                                            
                /+  /---+  Escherichia coli plasmid pPG010208 NC 019065 gbk...                                
                ||  |                                                                                         
                |\--+      /------------+  Aeromonas hydrophila plasmid pRA1 NC 012885 gbk...                 
                |   |      |                                                                                  
                |   \------+        /------------+  Klebsiella pneumoniae plasmid pNDM KN NC 019153 gbk...   <== 
                |          |  /-----+                                                                         
 /--------------+          \--+     \--------------+  Escherichia coli plasmid pNDM 1 Dok01 NC 018994 gbk... <==
 |              |             |                                                                               
 |              |             \-------+ Salmonella enterica sv Kentucky gbk...                                
 |              |                                                                                             
 |              |     /---------+  Escherichia coli plasmid peH4H NC 012690 gbk...                            
 |              |  /--+                                                                                       
=+              \--+  \  Escherichia coli plasmid pAR060302 NC 012692 gbk...                                  
 |                 |                                                                                          
 |                 \+ Salmonella enterica pAM04528 gbk...                                                     
 |                                                                                                            
 +-------+ Salmonella enterica YU39 pIncAC gbk...                                                             
 |                                                                                                            
 \--------------+ Salmonella enterica 33676 pIncAC gbk...                                                     
                                                                                                              
 |------|------|-----|------|------|------|-----|---                                                          
 0     25     50    75    100    125    150   175                                                             
 substitutions/site

As seen on the tree, the two pNDMs form a monophyletic group.

Let’s now find out what unique genes are carried specifically by these two plasmids.

# Detection of lineage-specific genes using parse_pangenome_matrix.pl
#    assumes that you are still in /home/vinuesa/pangenomics/pIncAC_homologues/pan_CM

# find the clusters containing the NDM-1 genes
grep -l blaNDM-1 *faa

# find the plasmids that hold these genes containing the NDM-1 genes
for f in $(grep -l blaNDM-1 *faa); do echo "### $f"; grep '>' $f | cut -d\| -f4; done


# generate the lists of genomes to be compared for lineage specific genes (in list A vs. B)
panGmat_dir=$(pwd)
cd ${gbk_dir}
ls *gbk | grep pNDM > listA_pNDB
ls *gbk | grep -v pNDM > listB_nonNDB
cd $panGmat_dir

# >>> parse the pangenome matrix file to find the listA-specific genes
parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B $gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab


# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -I  -A /home/vinuesa/pangenomics/pIncAC/listA_pNDB -B $/pangenomics/pIncAC/listB_nonNDB -a 0 -g 1 -e 0 -p  -s 0 -l 0 -x 0 -P 100 -S 0

# matrix contains 373 clusters and 12 taxa

# taxa included in group A = 2

# taxa included in group B = 10


# finding genes present in A which are absent in B ...
# file with genes present in set A and absent in B (19): pangenome_matrix_t0__pangenes_list.txt


# explore the file with genes present in set A and absent in B (19): pangenome_matrix_t0__pangenes_list.txt
#  i.e. present only in the NDM-1 expressing plasmids
cat pangenome_matrix_t0__pangenes_list.txt

# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -A /home/vinuesa/pangenomics/pIncAC/listA_pNDB -B /home/vinuesa/pangenomics/pIncAC/listB_nonNDB -g 1 -e 0 -p  -P 100 -S 0
# genes present in set A and absent in B (19):
2020_hypothetical_protein.fna
2021_type_I_restriction-m...fna
2022_anticodon_nuclease.fna
2023_restriction_modifica...fna
2024_ATP-dependent_DNA_he...fna
2025_DNA_replication_and_...fna
2026_type_I_site-specific...fna
2027_Tn7-like_transpositi...fna
2028_Tn7-like_transpositi...fna
2029_Tn7-like_transpositi...fna
2030_Tn7-like_transpositi...fna
2033_truncated_Rhs_family...fna
2036_GroES.fna
2037_CutA1.fna
2038_oxidoreductase_domai...fna
2039_phosphoribosylanthra...fna
2040_bleMBL.fna
2041_NDM-1.fna
2043_RmtB.fna

3.8.1 Mapping the pNDM-specific genes on a plasmid map

As seen from the output above, the cluster numbers are consecutive, indicating that the pNDM genes are part of a larger gene cluster. This can be visually confirmed by mapping the specific genes on the map of one of the “listA” genomes

The parse_pangenome_matrix.pl options, again
- parse_pangenome_matrix.pl

[options]: 
-h   this message
-m   input pangenome matrix .tab                               (required, made by compare_clusters.pl)
-s   report cloud,shell,soft core and core clusters            (optional, creates plot if R is installed)
-l   list taxa names present in clusters reported in -m matrix (optional, recommended before using -p option)
-x   produce matrix of intersection pangenome clusters         (optional, requires -s)
-I   use only taxon names included in file                     (optional, ignores -A,-g,-e)
-A   file with taxon names (.faa,.gbk,.nucl files) of group A  (optional, example -A clade_list_pathogenic.txt)
-B   file with taxon names (.faa,.gbk,.nucl files) of group B  (optional, example -B clade_list_symbiotic.txt)
-a   find genes/clusters which are absent in B                 (optional, requires -B)
-g   find genes/clusters present in A which are absent in B    (optional, requires -A & -B)
-e   find gene family expansions in A with respect to B        (optional, requires -A & -B)
-S   skip clusters with occupancy <S                           (optional, requires -x/-a/-g, example -S 2)
-P   percentage of genomes that must comply presence/absence   (optional, default=100, requires -g)
-p   plot pangenes on the genome map of this group A taxon     (optional, example -p 'Escherichia coli K12',
                                                                requires -g, -A & -B, only works with clusters
                                                                derived from input files in GenBank format)

Let’s use option $-p$ to plot pangenes on the genome map of _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk, a representative of “group A” genomes saved in listA_pNDB.

parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B $gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab -p _Klebsiella_pneumoniae_plasmid

#  plot the positions of listA-specific genes on one of the genomes of this group
#      use option -l to list taxa names present in clusters reported in -m matrix
parse_pangenome_matrix.pl -l -m pangenome_matrix_t0.tab
parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B $gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab -p "Klebsiella pneumoniae"

# visualize the mapping of the pNDM-specific genes on _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk, as a representative of "group A"
eog pangenome_matrix_t0_Klebsiellapneumoniae_NC_019153.png &

Figure 8. Mapping the pNDM-specific genes on _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk, as a representative of “group A”

After localizing the genomic region of interest, we can plot it with add hoc tools like snapgene-viewer

the full pNDM-KN_NC_019153 map

The unique region, spanning region between nucleotides 79,000 .. 111,500 of pNDM-KN_NC_019153

and a zoom-in into the blaNDM-1 region

3.9 Deactivate the get_homologues conda environment

Before leaving, run:

# Deactivate the get_homologues conda environment
conda deactivate

to deactivate the get_homologues conda environment.

That’s it, cheers.

4 Estimating core- and pan-genome maximum-likelihood phylogenies with the GET_PHYLOMARKERS suite

GET_PHYLOMARKERS can be used to estimate maximum-likelihood phylogenies from core- and pan-genome datasets computed by GET_HOMOLOGUES.

The main usage of the GET_PHYLOMARKERS pipeline is to estimate core-genome phylogenies with selected optimal markers, or a maximum-likelihood pangenome phylogeny estimated from the pangenome matrix.

To estimate core-genome phylogenies, GET_PHYLOMARKERS (rel. 2.2.1 2024-04-18) first runs a series of sequential filters (Fig. 8 and explained below) to identify markers with optimal attributes for phylogenomic inference, that satisfy the following criteria:

Runs the maximal matched-pairs tests to assess violations of the data to the Stationarity, Reversibility, and Homogeneity (SRH) assumptions made by maximum-likelihood phylogenetic models, as implemented in IQ-TREE.
Codon (gene) alignments without significant signals of homologous recombination (intragene mosaicism).
Gene trees without anomalous branch-lengths or topologies, as expected under the multispecies-coalescent null model (KDE test).
Compute bipartition support values under best-fitting substitution models to discard poorly-resolved trees.

Figure 8 The GET_PHYLOMARKERS pipeline

GET_PHYLOMARKERS functionality overview
- Estimates gene-trees and species-trees under the maximum likelihood (ML) optimality criterion using state-of-the-art ML tree searching algorithms.
- The species tree is estimated using a concatenation-independent approach from top-scoring ML gene tress using wASTRAL from the ASTER* package, and from the supermatrix of concatenated, top-scoring alignments that passed the quality filters (Fig. 8), using either IQTree by default, or FastTree.
- The stringency of these filters and the thoroughness of the ML tree searches can be controlled by the user, although sensible defaults are provided, making it an easy-to-use, user-friendly software.
- The pan-genome phylogeny can be estimated under the parsimony or maximum-likelihood optimality criteria, using the pan-genome matrix computed by GET_HOMOLOGUE’s compare_clusters.pl $script$
- In run mode 2 (run_get_phylomarkers_pipeline.sh -R 2) for population genetics (multiple sequences from the same species), GET_PHYLOMARKERS computes a population tree from a SNP matrix of top-scoring, neutral alignments.

Third-party binaries distributed with GET_PHYLOMARKERS as of rel. 2
- clustal omega. Multiple sequence alignment software. Sievers et al. 2011.
- parallel. Executes processes in parallel on multicore machines
- Phi test. Recombination test software. Bruen et al. 2006
- FastTree: Fast maximum-likelihood tree searching program. Price et al. 2010
- ModelFinder: Fast model selection for accurate phylogenetic estimates. (Kalyaanamoorthy et al. 2017)
- IQ-TREE. Highly accurate maximum-likelihood tree searching program. Nguyen et. al (2015)
- ASTER*. A Package for Large-scale Phylogenomic Reconstructions. Zhang et al. 2025.
- paup*. Multipurpose phylogenetics software package developed by David Swofford and colleagues.
- pars, seqboot and consense from Joe Felsenstein’s PHYLIP package.
- snp-sites. Finds SNP sites from a multi-FASTA alignment file.
- nw_reroot and nw_support from the Newick utilities package.
- you may also need to install bc,an arbitrary-precision language for performing math calculations with Bash and other shells

4.1 GET_PHYLOMARKERS usage

The GET_PHYLOMARKERS package implements several analysis options. Here we will review only some of them. Visit the GET_PHYLOMARKERS manual

4.1.1 Usage synopsis

The pipeline (Fig. 8) is run by executing the main script run_get_phylomarkers_pipeline.sh inside a folder containing twin *.fna and *.faa FASTA files for orthologous single-copy CDSs and translation products (Fig. 8).
There are two runmodes: -R 1 (for phylogenetics) and -R 2 (for population genetics).
The pipeline can be run on two molecular types: DNA or protein sequences (-t DNA|PROT). The latter is intended for the analysis of more divergent genome sequences, typically above the genus level.
From version 2.0 onward, GET_PHYLOMARKERS uses either the FastTree (FT) or IQ-TREE (IQT) fast ML tree search algorithms, controlled with the -A [F|I] option [default: I], respectively.
As of version 2.0 (January 22, 2018), GET_PHYLOMARKERS uses IQ-TREE version 1.6.1 (released Dec. 28th, 2017) which implements a fast search option, which almost matches the speed of FastTree, but retaining the accuracy of IQ-TREE 1.5.* Model-selection is performed with the -fast flag to for maximal speed both for gene- and species tree searches.
The global molecular-clock hypothesis can be evaluated for DNA (codon) alignments (-R 1 -t DNA -K). It is not yet implemented for protein sequences.
GET_PHYLOMARKERS can compute basic descriptive statistics of population DNA polymorphisms and neutrality tests when run in population genetics mode (-R 2).

4.1.1.1 GET_PHYLOMARKERS 2

Starting with release 2.0.0 (2022-11-20), GET_PHYLOMARKERS also computes a concatenation-independent species tree from the ML gene trees estimated from top-scoring alignments using ASTRAL-III.
Release 2.1.0 (2024-03-31) introduced maximal matched-pairs tests to assess violations of the data to the Stationarity, Reversibility, and Homogeneity (SRH) assumptions made by maximum-likelihood phylogenetic models, as implemented in IQ-TREE. Additionally, ASTRAL-IV is used to estimate the species tree directly from the filtered ML source gene trees, which computes terminal and internal branch lengths in substitution-per-site units.
Release 2.2.0 (v2.2.0, 2024-04-14) introduced new features, most notably:
- significant extension of run mode 2 (run_get_phylomarkers_pipeline.sh -R 2) for population genetics (multiple sequences from the same species), computing a population tree from a SNP matrix of top-scoring, neutral alignments.
- Reporting of a detailed numeric overview of the different filtering steps
- run_get_phylomarkers_pipeline.sh now also calls the C binary WEIGHTED-ASTRAL to estimate a species tree using as input the filtered gene trees estimated by iqtree2 or FastTree
- Complex protein mixture models are used for concatenated protein alignments
Release 2.2.1 (2024-04-16) includes a static binary of snp-sites, which is called by run_get_phylomarkers_pipeline.sh >= v2.8.1.0_2024-04-15 under run mode 2 (-R 2, population genetics) to generate SNP matrices in FASTA and VCF formats from the concatenated alignments of filtered, highly informative, and neutral loci. The FASTA SNP matrix is used for estimating a ML population tree. Thanks to Alfredo Hernández @ccg_unam for compiling snp-sites-static.

4.1.2 GET_PHYLOMARKERS $scripts$

Shell scripts
- estimate_pangenome_phylogenies.sh
- run_get_phylomarkers_pipeline.sh
- run_parallel_molecClock_test_with_paup.sh
- run_pexec_cmmds.sh
R scripts
- compute_suppValStas_and_RF-dist.R
- install_R_deps.R
- run_kdetrees.R
Perl scripts
- add_labels2tree.pl
- add_nos2fasta_header.pl
- concat_alignments.pl
- convert_aln_format_batch_bp.pl
- pal2nal.pl
- popGen_summStats.pl
- remove_uninformative_sites_from_aln.pl
- rename.pl
- run_parallel_cmmds.pl

4.1.3 Options and parameters to control de behaviour of run_get_phylomarkers_pipeline.sh

run_get_phylomarkers_pipeline.sh is the core $script$ of the GET_PHYLOMARKERS distribution. It implements many sophisticated phylogenetic analysis features using state-of-the-art methods and software. It is highly configurable, as reflected in the many options that an advanced user can set to fine-tune its behavior.

run_get_phylomarkers_pipeline.sh options can be displayed with any of the following equivalent commands
- run_get_phylomarkers_pipeline.sh
- run_get_phylomarkers_pipeline.sh -h

 
  run_get_phylomarkers_pipeline.sh v.2.8.1.4_2024-04-17 OPTIONS:

   REQUIRED:
    -R  RUNMODE
       1 select optimal markers for phylogenetics/phylogenomics (genomes from different species)
       2 select optimal markers for population genetics (genomes from the same species)
    -t  type of input sequences: DNA|PROT

   OPTIONAL:
     -h|--help flag, print this short help notes
     -H flag, print extensive Help and details about search modes and models
     -A  Algorithm for tree searching:  [FastTree|IQ-TREE]                            [default:I]
     -c  NCBI codontable number (1-23) for pal2nal.pl to generate codon alignment         [default:11]
     -C flag to print codontables
     -D  if you encounter problems                [default: 0]
     -e  select gene trees with at least (min. = 4) external branches  [default: 4]
     -f  GET_HOMOLOGUES cluster format                                            [default: STD]
     -I|--IQT_threads  threads/cores for IQTree2 searching                                [default: 2]
     -k  kde stringency (0.7-1.6 are reasonable values; less is more stringent)              [default: 1.5]
     -K flag to run molecular clock test on codon alignments                                       [default: 0]
     -l  max. spr length (7-12 are reasonable values)                                     [default: 8]
     -m  min. average support value (0.7-0.8 are reasonable values) for trees to be selected [default: 0.7]
     -M  base Model for clock test (use one of: GTR|TrN|HKY|K2P|F81); uses +G in all cases [default: GTR]
     -n  number of cores/threads to use for parallel computations except IQT searches     [default: all cores]
     -N  number of IQ-TREE searches to run [only active with -T high]                     [default: 5]
     -q  quantile (0.95|0.99) of Chi-square distribution for computing molec. clock p-value  [default: 0.99]
     -r  root method (midpoint|outgroup)                                                   [default: midpoint]
     -s  number of spr rounds (4-20 are reasonable values) for FastTree tree searching    [default: 4]
     -S  quoted 'comma-separated list' of base models to be evaluated by IQ-TREE 
           when estimating the species tree from the concatenated supermatrix (see -H for details). 
       If no -S is passed, then sinlge default models are used, as shown below
           <'JC,F81,K2P,HKY,TrN,TNe,K3P,K81u,TPM2,TPM2u,TPM3,TPM3u,
         TIM,TIMe,TIM2,TIM2e,TIM3,TIM3e,TVM,TVMe,SYM,GTR'> for DNA alignments              [default: GTR]
           <'BLOSUM62,cpREV,Dayhoff,DCMut,FLU,HIVb,HIVw,JTT,JTTDCMut,LG,
             mtART,mtMAM,mtREV,mtZOA,Poisson,PMB,rtREV,VT,WAG'> for PROT alignments            [default: LG]
     -T  tree search Thoroughness: high|medium|low|lowest (see -H for details)             [default: high]
     -v|--version flag, print version and exit
     -V|--Versions flag, print software versions


   INVOCATION EXAMPLES:
     1. default on DNA sequences (uses IQ-TREE evaluating a subset of models specified in the detailed help)
          run_get_phylomarkers_pipeline.sh -R 1 -t DNA -I 4
     2. thorough FastTree searching and molecular clock analysis on DNA sequences using 10 cores:
          run_get_phylomarkers_pipeline.sh -R 1 -t DNA -A F -k 1.2 -m 0.7 -s 20 -l 12 -T high -K -M HKY -q 0.95 -n 10
     3. FastTree searching on a huge protein dataset for fast inspection
          run_get_phylomarkers_pipeline.sh -R 1 -A F -t PROT -m 0.6 -k 1.0 -T lowest
     4. To run the pipeline on a remote server, we recommend using the nohup command upfront, as shown below:
        nohup run_get_phylomarkers_pipeline.sh -R 1 -t DNA -S 'TNe,TVM,TVMe,GTR' -k 1.0 -m 0.75 -T high -N 5 -I 8 &> /dev/null &
     5. Run in population-genetics mode (generates a table with descritive statistics for DNA-polymorphisms 
          and the results of diverse neutrality test)
          run_get_phylomarkers_pipeline.sh -R 2 -t DNA

   NOTES
     1: run from within the directory holding core gene clusters generated by get_homologues.pl -e or
          compare_clusters.pl with -t no_genome_gbk files (core genome: all clusters with a single gene copy per genome)
     2: If you encounter any problems, please run the script with the -D -V flags added at the end of the command line,
          redirect STOUT to a file and send us the output, so that we can better diagnose the problem.
          e.g. run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 1.0 -m 0.7 -s 8 -l 10 -T high -K -D &> ERROR.log

For the details, please visit the manual

4.2 Searching for the best core-genome phylogeny of the pIncA/C plasmids with GET_PHYLOMARKERS

# assumes that you are within test_sequences/pIncAC_homologues/ or in test_sequences/core_BCM
cd core_BCM

# 1. make sure we have the same nuber of faa and fna cluster files; 
#    Note that GET_PHYLOMARKERS will check that for you automatically ;)
find . -name '*.faa' | wc -l  # 31
find . -name '*.fna' | wc -l  # 31

# 2.1 view the help menu of the main script run_get_phylomarkers_pipeline.sh
run_get_phylomarkers_pipeline.sh -h

# 2.2 run the pipeline under default values. See the manual for the many additional options
run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 1.0 # takes ~18 seconds on chaac.

4.2.1 Overview of the GET_PHYLOMARKERS pipeline filtering steps

GET_PHYLOMARKERS runs six major analysis blocks, printing informative progress messages to screen (STDOUT) and to a logfile.

The main code blocks do the following:

Print run parameters and run sanity checks in input data
Generate and filter alignments
Estimate and filter gene trees
Compute a concatenation-independent species tree from top-scoring gene trees with wASTRAL
Concatenate top-scoring alignments and estimate a ML species tree from the resulting supermatrix
Tidy directories and summarize results

Compare these steps with the pipeline’s flowchart shown in Fig. 8 to make sure you understand what computations are performed at each step.

4.2.1.1 Parameterization report and user input sanity checks

The pipeline first reports the parameters used to call it, followed by a sanity check of the input .*fna and .faa files, making sure they contain the same number of sequences and a single instance of each genome, as expected for orthologous clusters.

Fig. 9. Pipeline parameters and user input sanity checks

4.2.1.2 Compute and filter multiple sequence alignments

The pipeline runs multiple sequence alignments in parallel using $clustal-omega$ for the 31 consensus loci to generate codon-alignments
Runs the maximal matched-pairs tests to assess violations of the data to the Stationarity, Reversibility, and Homogeneity (SRH) assumptions
Alignments passing the previous filter are subjected to the $phi(w)$ test to scan for recombinant sequences. Note that in our data set some of the loci lack of sufficient polymorphisms for the test to work, as signaled by the warning messages. There is no evidence for recombinant loci in the test set

Fig. 10. Compute and filter multiple sequence alignments with the SHR and Phi tests

4.2.1.3 Estimate and filter gene trees

Parallel IQ-TREE gene tree searches follow. Trees lacking at least one internal branch (i.e. with < 4 terminals) are detected and discarded. There are five of them in our example
The KDEtrees test identifies “outlier phylogenies” with anomalous branch lenghts or topologies. One outlier identified
Average support values of gene trees are computed for the remaining 25 loci to discard poorly resovled ones: 10 in this dataset

Fig. 11. Estimate and filter gene trees with the KDEtrees and global support value tests

4.2.1.4 Compute a concatenation-independent species tree from top-scoring gene trees with ASTRAL-IV and wASTRAL

A new directory is created containing the top markers that passed the previous filters, 14 in our case. - Astral IV and wASTRAL is called on the top-scoring gene trees to compute a concatenation-independent species tree - Gene concordance factors (gcFs) are computed for each source gene tree against the ASTRAL-IV and wASTRAL trees - A supermatrix is generated from the 14 top-scoring alignments passing the previous filters

Fig. 12. Compute a concatenation-independent species tree from top-scoring gene trees with wASTRAL

4.2.1.5 Concatenate top-scoring alignments and estimate a ML species tree from the resulting supermatrix

IQ-TREE (default) is used by default to find the best substitution model (DNA or protein) with ModelFinder, and estimating the species tree from the supermatrix, reporting the best model and lnL score found.
Robinson-Fould distances are computed between each gene tree and the species tree

Fig. 13. Concatenate top-scoring alignments and estimate a ML species tree from the resulting supermatrix

4.2.1.6 Tidy directories and summarize results

The script finally does some cleanup in the different working directories, summarizing the pipeline’s filtering results, and listing the files produced

Fig. 14. Tidy directories and summarize results

4.2.2 Exploring the pipeline’s results directories

After the run finishes, we can start exploring the directories holding the results.


# >>> explore the directories:
$ ls
get_phylomarkers_run_AIR1tDNA_k1.0_m0.7_Thigh_10Aug25
get_phylomarkers_run_AIR1tDNA_k1.0_m0.7_Thigh_10Aug25.log

cd get_phylomarkers_run_AIR1tDNA_k1.0_m0.7_Thigh_10Aug25   # holds the compressed input FASTAS and alignments,
                                                           # the SRHtest summariy tables, and the
                                                           # overview tables summarizing the results

$ ls

codon_alignments.tgz    parsed_SRHtests.tsv                 pipeline_filtering_overview.tsv     SRHtests_failed.tgz  SRHtests_passed.tsv
numbered_faa_files.tgz  PhiPack                             pipeline_output_files_overview.tsv  SRHtests_failed.tsv  tree_labels.list
numbered_fna_files.tgz  pipeline_figure_files_overview.tsv  protein_alignments.tgz              SRHtests_OK.tgz

4.2.2.1 The PhiPac directory holds the results of the Phi recombination tests


# >>> cd into PhiPac/ dir, which holds the results of the recombination tests.
$ cd PhiPack 

# have a peak a the file
head Phi_results_07Aug25.tsv
## ./1961_hypothetical_protein_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1962_DNA_topoisomerase_II_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1964_hypothetical_protein_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1966_site-specific_DNA_me_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1967_hypothetical_protein_cdnAln_Phi.log   1   1   TOO_FEW_INFORMATIVE_SITES
## ./1968_hypothetical_protein_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1969_putative_type_I_rest_cdnAln_Phi.log   1   1   TOO_FEW_INFORMATIVE_SITES
## ./1970_KorB_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1971_ParA_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1984_hypothetical_protein_cdnAln_Phi.log   1.00e+00    1.00e+00

4.2.2.2 The non_recomb_cdn_alns/non_recomb_FAA_alns directories hold the results of phylogenetic tests on gene trees: kdetrees, model-selection and signal

Depending on the type of the input sequences (-t DNA|PROT), the results of the different phylogenetic filters will be stored either in the non_recomb_cdn_alns (-t DNA) or in the non_recomb_FAA_alns (-t PROT) directories.

# >>> cd into the directory holding the non-recombinant sequences, 
# which contains the results of several phylogenetic attributes of the loci
cd non_recomb_cdn_alns/

# 1.1 to identify the kde-test outlier loci see the contents of kde_outliers, in this case only 1 "outlier" locus was detected
ls kde_outliers

# 1.2 the details of the kdetrees test are provided in tsv-formatted file kde_dfr_file_all_gene_trees.tre.tab
head kde_dfr_file_all_gene_trees.tre.tab
## file kde_topo_dens   kde_topo_test   kde_bl_dens kde_bl_test kde_bl_topo_test
## 1961_hypothetical_protein_cdnAln.fasta.treefile  0.288751935193142   ok  63.2675110841703    ok  ok
## 1962_DNA_topoisomerase_II_cdnAln.fasta.treefile  0.252919761356219   ok  28.7241410392773    ok  ok
## 1964_hypothetical_protein_cdnAln.fasta.treefile  0.287751455109637   ok  17.2677657886815    ok  ok
## 1966_site-specific_DNA_me_cdnAln.fasta.treefile  0.240400359789596   outlier 27.0113341481227    ok  outlier
## 1967_hypothetical_protein_cdnAln.fasta.treefile  0.252639796140819   ok  95.8112601171774    ok  ok
## 1968_hypothetical_protein_cdnAln.fasta.treefile  0.297536609843425   ok  96.2903342166101    ok  ok
## 1969_putative_type_I_rest_cdnAln.fasta.treefile  0.314498361683638   ok  95.2603170008101    ok  ok
## 1970_KorB_cdnAln.fasta.treefile  0.331685645962565   ok  79.0447022830493    ok  ok
## 1971_ParA_cdnAln.fasta.treefile  0.334991727858742   ok  55.4211725899592    ok  ok

# 2.1 have a look at the counts of best-fitting models selected
cat IQT_best_model_counts_for_gene_trees.tsv

## model    count
## HKY+F    2
## HKY+F+I  1
## K2P  22
## K2P+G4   1
## K3P  1
## TIMe 1
## TPM2 1
## TPM3 1

# 2.2 and locus-specific stats are provied in IQT_DNA_gene_tree_Tmedium_stats.tsv
$ column -t IQT_DNA_gene_tree_Thigh_stats.tsv
## alignment                                     wc_secs  CPU_secs  lnL        model    s_type
## ./1961_hypothetical_protein_cdnAln.fasta.log  0.062    0.060     -1104.427  K3P      IQTdnaThigh
## ./1962_DNA_topoisomerase_II_cdnAln.fasta.log  0.143    0.140     -4124.454  HKY+F+I  IQTdnaThigh
## ./1964_hypothetical_protein_cdnAln.fasta.log  0.136    0.132     -1225.128  K2P+G4   IQTdnaThigh
## ./1966_site-specific_DNA_me_cdnAln.fasta.log  0.065    0.064     -2183.089  K2P      IQTdnaThigh
## ./1967_hypothetical_protein_cdnAln.fasta.log  0.046    0.045     -870.935   K2P      IQTdnaThigh
## ./1968_hypothetical_protein_cdnAln.fasta.log  0.048    0.051     -1179.667  K2P      IQTdnaThigh
## ./1969_putative_type_I_rest_cdnAln.fasta.log  0.043    0.039     -1111.929  K2P      IQTdnaThigh
## ./1970_KorB_cdnAln.fasta.log                  0.053    0.054     -2112.817  TIMe     IQTdnaThigh
... Truncated

# 3. graphical summaries of the results of the kdetrees and the distribuions of support-values are provided as svg files.
ls -1 *svg
## dotplot_and_bxplot_kdeDensity_dist_dissim_topo_TRUE.svg
## parallel_bxplots_kdeDensity_dist_dissim_topo_TRUE-FALSE.svg
## scatterplot_for_gene_tree_support_values.svg

Figure 15 below depicts the results of the non-parametric KDEtrees test, run at the default stringency level of k = 1.0. As depicted on the graph, only one outlier is detected based on the topology (lower panel). Note that this and following graphs plot the results of a small (toy) dataset. They become more useful with larger ones.

IMPORTANT NOTE: to visualize the figures, you need to access the corresponding files from your local host, as the Docker container or Conda environments do not provide a graphical environment with visualization tools! Just open a second terminal (or use your system’s file browser) to access the working directory and display the figures.

Figure 16 depicts a scatterplot and a histogram summarizing the distribution of mean SH-support values for the 25 gene trees that reached this point in the pipeline.

4.2.2.3 The top_N_markers_geNperc directory holds the supermatrix and species tree estimated from the concatenated top-ranking markers and other phylogenetic analyses performed on them

Finally we will inspect the contents of the top_14_markers_ge70perc/ directory, which holds the top-scoring markers that passed all of the above-mentioned filters (Fig. 1). In this directory you will find:

Tthe ASTRAL analysis of gene trees and corresponding concatenation-independent species tree
The supermatrix resulting from their concatenation and the “species-tree” estimated from it under the best-fitting model identified by ModelFinder/IQ-TREE.
It contains also the resutls of RF-distances of gene trees to species-tree and the global molecular clock test results if the main script was invoked with -R 1 -t DNA -K

# >>> inspecting the contents of the top_14_markers_ge70perc/ directory
cd top_14_markers_ge70perc

# The concatenation-independent species tree computed from top-scoring ML gene trees with wASTRAL.
figtree astral_top14geneTrees_ed.sptree &

# The concatenation coordinates for the supermatrix are saved in concatenation_coordinates.txt
$ cat concatenation_coordinates.txt

## # concatenation coordinates:
## 1961_hypothetical_protein_cdnAln.fasta = 1-615
## 1966_site-specific_DNA_me_cdnAln.fasta = 616-2040
## 1968_hypothetical_protein_cdnAln.fasta = 2041-2712
## 1970_KorB_cdnAln.fasta = 2713-3888
## 1984_hypothetical_protein_cdnAln.fasta = 3889-4221
## 1989_hypothetical_protein_cdnAln.fasta = 4222-5076
## 1990_hypothetical_protein_cdnAln.fasta = 5077-5592
## 1994_DSBA_oxidoreductase_cdnAln.fasta = 5593-6450
## 1995_putative_signal_pept_cdnAln.fasta = 6451-7356
## 1996_hypothetical_protein_cdnAln.fasta = 7357-8208
## 1997_hypothetical_protein_cdnAln.fasta = 8209-8649
## 1998_hypothetical_protein_cdnAln.fasta = 8650-9168
## 1999_hypothetical_protein_cdnAln.fasta = 9169-9627
## 2012_TraF_cdnAln.fasta = 9628-10641

# Some phylogenetic properties of the markers are summarized in sorted_aggregated_support_values4loci.tab
# graphical analysis of RF-distances of gene-trees to the species-tree is found in 
#  scatterplot_RF-dist_of_gene_trees2concat_phylo.svg

# The final tree concat_nonRecomb_KdeFilt_iqtree_GTR+F+ASC_ed.sptree can be conveniently 
#  visualized and edited with figtree
figtree concat_nonRecomb_KdeFilt_iqtree_GTR+F+ASC_ed.sptree &

Figure 17 depicts the concatenation-independent species tree computed from top-scoring ML gene trees with wASTRAL.

Figure 18 depicts the RF-distances of the 14 top-scoring markers to the ML tree inferred under the GTR+F+ASC substitution model from the concatenation of these loci (shown in Fig. 19)

Figure 19 depicts the best ML “species tree” inferred under the GTR+F+ASC substitution model from the supermatrix of 15 top-scoring markers. The nodes are colored according to the legend. The values on the left correspond to approximate Bayes branch support values and those to the right to the UFBoot values described in the manual.

4.3 Estimating the ML pan-genome phylogeny of the pIncA/C plasmids with GET_PHYLOMARKERS

# assumes that you are within test_sequences/pIncAC_homologues/, or in test_sequences/pan_genome
cd pan_CM

# 1. find the pangenome_matrix
ls pangenome_matrix*


# 2. run estimate_pangenome_phylogenies.sh launching 5 independent iqtree searches, 
#    fitting binary (BIN) models
estimate_pangenome_phylogenies.sh -f pangenome_matrix_t0.fasta -r 5

4.3.1 Inspecting and visualizing the output of the ML pan-genome tree

The estimate_pangenome_phylogenies.sh reports its progress to the screen, informing that: - it created and moved into subdir iqtree_PGM_5_runs, which holds the results of the analysis - the best-fitting binary model was GTR2+FO - after determining the best model, it performs 10 independent IQ-TREE searches - Best IQ-TREE run was: abayes_UFBboot_run5.log:BEST SCORE FOUND : -1897.260 - wrote file best_PGM_IQT_abayes_UFBboot_run9_GTR2+FO.treefile in pan_CM/iqtree_PGM_5_runs/iqtree_10_runs

Lets have a look at the tree search profile

cd iqtree_PGM_10_runs/iqtree_10_runs

cat sorted_lnL_scores_IQ-TREE_searches.out
## abayes_UFBboot_run3.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run2.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run4.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run1.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run5.log:BEST SCORE FOUND : -1897.260

No wonder, in such a small dataset, all searches find exactly the same best tree. However, this will be less likely the case as the number of sequences in the dataset increases, as the size of the tree-spece increases factorialy with each new sequence.

figtree best_PGM_IQT_abayes_UFBboot_run5_GTR2+FO.treefile &

Figure 20 displays the best ML pan-genome tree, again using figtree. The nodes are colored according to the legend. The first value corresponds to approximate Bayes branch support values and second ones to the UFBoot values described in the manual.

–>

If you are running the tutorials from a Docker exit the container typing $exit$
If you are running the tutorials from the get_phylomarkers Conda environment, deactivate the environment typing $deactivate$, returning to the base environment.

That’s it, enjoy!

5 Developers

The code is developed and maintained by Pablo Vinuesa at CCG-UNAM, Mexico and Bruno Contreras-Moreira at EEAD-CSIC, Spain. It is released to the public domain under the GNU GPLv3 license.

6 Citations

7 Source

Both software packages are freely distributed through their respective GitHub repositories:
- get_homologues - GitHub repo
- get_phylomarkers - GitHub repo

8 Docker images for GET_HOMOLOGUES and GET_PHYLOMARKERS

Both software packages can be run as docker images freely distributed through DockerHub:
- get_homologues+get_phylomarkers docker image
- get_phylomarkers docker image

9 Documentation

The GET_HOMOLOGUES and GET_PHYLOMARKERS packages have extensive documentation and tutorials online
- get_homologues manual
- get_phylomarkers manual

10 Acknowledgements

10.1 Personal

We thank Alfredo J. Hernández and Víctor del Moral at CCG-UNAM for technical support.

10.2 Funding

We gratefully acknowledge the funding provided by DGAPA-PAPIIT/UNAM (grants IN201806-2, IN211814, IN206318, IN209321, and IN216424) and CONACyT-Mexico (grants P1-60071, 179133, FC-2015-2-879, A1-S-11242, and CBF-2025-I-1106) to Pablo Vinuesa, as well as the Fundación ARAID,Consejo Superior de Investigaciones Científicas (grant 200720I038 and Spanish MINECO (AGL2013-48756-R) to Bruno Contreras-Moreira.

Pangenomic and phylogenomic analyses of pIncC plasmids using GET_HOMOLOGUES and GET_PHYLOMARKERS

Pablo Vinuesa, CCG-UNAM (https://www.ccg.unam.mx/~vinuesa/); @pvinmex

2025-08-07