This tutorial will show how to use the get_homologues (Contreras-Moreira & Vinuesa, 2013) and get_phylomarkers (Vinuesa et al. 2018) software packages to perform a thorough pan-genomic and phylogenomic analyses of a small dataset of conjugative multiresistance plasmids of the pIncC (formerly IncA/C) incompatibility group of great concern in the clinical setting Ambrose et al. 2018. This is an updated version of a similar tutorial published by Vinuesa & Contreras-Moreira 2015.
Both software packages are freely distributed through their respective GitHub repositories:
get_homologues - GitHub repo can be installed as a BioConda package and run in a conda environment.
Both software packages can be conveniently installed and run as containerized Docker images available from:
Both software packages can be run on any platform using the Docker container, or directly on UNIX and GNU/Linux servers.
In any case, you will have to work using Linux \(Shell\) commands. Therefore it is important that you have some proficiency working on the command line.
Here is a complete tutorial on biocomputing in the GNU/Linux environment, with lots of examples.
The whole exercises will be run on buluc or tepeu You need to log into the server with the \(ssh\) command, available in any Linux machine or from the \(mobaXterm\) package that you should have installed on your Windows machine.
type your password
ssh -X vinuesa@*.*.*.* vinuesa@*.*.*.*'s password: Last login: Wed Nov 11 17:46:45 2020 from akbal.*.*.* ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo Bienvenidos a BULUC ... [vinuesa@buluc ~]$
The GET_PHYLOMARKERS distribution provides a test_sequences/ directory which holds the subdirectories core_genome/, pan_genome/ and pIncAC/.
The pIncAC/ directory holds the source *.gbk GenBank files. This directory has a README.txt file that briefly describes the GenBank files.
#0. Welcome to buluc|chaac|tepeu
# processors available
awk '/^proc/ {n++} END{print n}' /proc/cpuinfo
# disk space on file system
df -h
# check system load
htop
#>>>>> 1) mkdir sesion7_pangenomics, cd into it and create symbolic links to the source GenBank files to be processed (12 pIncA/C plasdmis)
# The paths provided below will only work for users running in their home account on buluc.
# cd to your working directory and
mkdir sesion7_pangenomics && cd sesion7_pangenomics
# save path in variable top_dir
top_dir=$(pwd)
echo $top_dir
# fetch the data from TIB-filo's GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion7_pangenomica_GET_HOMOLOGUES/data/pIncAC.tgz
tar -xzf pIncAC.tgz && rm pIncAC.tgz && cd pIncAC
# list files
ls -l
# how many gbk files do we have?
ls *gbk | wc -l
# peak at one of them using head and tail
head -50 _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk
tail _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk
# How many CDSs do these gbk files have? - identify the genome with the lowest number of CDSs
grep -c ' CDS ' *gbk | sed 's/:/\t/' | sort -nk2
# save the absolute path of this directory for easy reference later on
gbk_dir=$(pwd)
# return to the parent directory and save its full path to the variable top_dir for easy reference
# cd ..
cd $top_dir
# where am I?
pwd
# list; you should see the pIncAC/ directory holding the *gbk files
ls
In this section we will compute consensus core- and pan-genomes of the pIncA/C dataset as the intersection of clusters computed by the tree clustering algorithms implemented in get_homologues, as shown in the flowchart below:
GET_HOMOLOGUES is a sophisticated piece of software, with several external dependencies and auxiliary scripts, as depicted in Fig. 1.
To run GET_HOMOLOGUES on chaac we need to start the corresponding conda environment
# Activate conda's get_homologues environment
conda activate get_homologues
To make sure that get_homlogues.pl is in PATH and that the kit is complete type the following command
get_homologues.pl -v
# 0. make sure get_homlogues.pl in PATH and that the kit is complete
/export/apps/get_homologues/get_homologues.pl version 26022020
Program written by Bruno Contreras-Moreira (1) and Pablo Vinuesa (2).
1: http://www.eead.csic.es/compbio (Estacion Experimental Aula Dei/CSIC/Fundacion ARAID, Spain)
2: http://www.ccg.unam.mx/~vinuesa (Center for Genomic Sciences, UNAM, Mexico)
Primary citations:
Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and
robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696-701. (PubMed:24096415)
Vinuesa P and Contreras-Moreira B (2015) Robust Identification of Orthologues and Paralogues for
Microbial Pan-Genomics Using GET_HOMOLOGUES: A Case Study of pIncA/C Plasmids. In Bacterial Pangenomics,
Methods in Molecular Biology Volume 1231, 203-232, edited by A Mengoni, M Galardini and M Fondi.
This software employs code, binaries and data from different authors, please cite them accordingly:
OrthoMCL v1.4 (www.orthomcl.org , PubMed:12952885)
COGtriangles v2.1 (sourceforge.net/projects/cogtriangles , PubMed=20439257)
NCBI Blast-2.2 (blast.ncbi.nlm.nih.gov , PubMed=9254694,20003500)
Bioperl v1.5.2 (www.bioperl.org , PubMed=12368254)
HMMER 3.1b2 (hmmer.org)
Pfam (http://pfam.xfam.org , PubMed=26673716)
Diamond v0.8.25 (https://github.com/bbuchfink/diamond, PubMed=25402007)
Checking required binaries and data sources, all set in phyTools.pm :
EXE_BLASTP : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/blastp)
EXE_BLASTN : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/blastn)
EXE_FORMATDB : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/makeblastdb)
EXE_MCL : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/mcl-14-137/src/shmcl/mcl)
EXE_MAKEHASH : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGmakehash/COGmakehash )
EXE_READBLAST : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGreadblast/COGreadblast )
EXE_COGLSE : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGlse/COGlse )
EXE_COGTRI : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGtriangles/COGtriangles )
EXE_HMMPFAM : OK (/export/apps/get_homologues-x86_64-20200226//bin/hmmer-3.1b2/binaries/hmmscan --noali --acc --cut_ga /export/apps/get_homologues-x86_64-20200226/db/Pfam-A.hmm)
EXE_DMNDP : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/diamond-0.8.25/diamond blastp)
EXE_DMNFT : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/diamond-0.8.25/diamond makedb)
EXE_INPARA : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeInparalog.pl)
EXE_ORTHO : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeOrtholog.pl)
EXE_HOMOL : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeHomolog.pl)
EXE_SPLITBLAST : OK (path:/export/apps/get_homologues-x86_64-20200226/_split_blast.pl)
EXE_SPLITHMMPFAM : OK (path:/export/apps/get_homologues-x86_64-20200226/_split_hmmscan.pl)
The main pipeline depicted above (Fig. 1) is run with the core \(script\) get_homologues.pl.
The options of the \(script\) are displayed by simply calling it with any of the following commands
get_homologues.pl
get_homologues.pl -h
usage: /export/apps/get_homologues/get_homologues.pl [options]
-h this message
-v print version, credits and checks installation
-d directory with input FASTA files ( .faa / .fna ), (overrides -i,
GenBank files ( .gbk ), 1 per genome, or a subdirectory use of pre-clustered sequences
( subdir.clusters / subdir_ ) with pre-clustered sequences ignores -c, -g)
( .faa / .fna ); allows for new files to be added later;
creates output folder named 'directory_homologues'
-i input amino acid FASTA file with [taxon names] in headers, (required unless -d is set)
creates output folder named 'file_homologues'
Optional parameters:
-o only run BLAST/Pfam searches and exit (useful to pre-compute searches)
-c report genome composition analysis (follows order in -I file if enforced,
ignores -r,-t,-e)
-R set random seed for genome composition analysis (optional, requires -c, example -R 1234,
required for mixing -c with -c -a runs)
-m runmode [local|cluster] (default local)
-n nb of threads for BLAST/HMMER/MCL in 'local' runmode (default=2)
-I file with .faa/.gbk files in -d to be included (takes all by default, requires -d)
Algorithms instead of default bidirectional best-hits (BDBH):
-G use COGtriangle algorithm (COGS, PubMed=20439257) (requires 3+ genomes|taxa)
-M use orthoMCL algorithm (OMCL, PubMed=12952885)
Options that control sequence similarity searches:
-X use diamond instead of blastp (optional, set threads with -n)
-C min %coverage in BLAST pairwise alignments (range [1-100],default=75)
-E max E-value (default=1e-05,max=0.01)
-D require equal Pfam domain composition (best with -m cluster or -n threads)
when defining similarity-based orthology
-S min %sequence identity in BLAST query/subj pairs (range [1-100],default=1 [BDBH|OMCL])
-N min BLAST neighborhood correlation PubMed=18475320 (range [0,1],default=0 [BDBH|OMCL])
-b compile core-genome with minimum BLAST searches (ignores -c [BDBH])
Options that control clustering:
-t report sequence clusters including at least t taxa (default t=numberOfTaxa,
t=0 reports all clusters [OMCL|COGS])
-a report clusters of sequence features in GenBank files (requires -d and .gbk files,
instead of default 'CDS' GenBank features example -a 'tRNA,rRNA',
NOTE: uses blastn instead of blastp,
ignores -g,-D)
-g report clusters of intergenic sequences flanked by ORFs (requires -d and .gbk files)
in addition to default 'CDS' clusters
-f filter by %length difference within clusters (range [1-100], by default sequence
length is not checked)
-r reference proteome .faa/.gbk file (by default takes file with
least sequences; with BDBH sets
first taxa to start adding genes)
-e exclude clusters with inparalogues (by default inparalogues are
included)
-x allow sequences in multiple COG clusters (by default sequences are allocated
to single clusters [COGS])
-F orthoMCL inflation value (range [1-5], default=1.5 [OMCL])
-A calculate average identity of clustered sequences, (optional, creates tab-separated matrix,
by default uses blastp results but can use blastn with -a recommended with -t 0 [OMCL|COGS])
-P calculate percentage of conserved proteins (POCP), (optional, creates tab-separated matrix,
by default uses blastp results but can use blastn with -a recommended with -t 0 [OMCL|COGS])
-z add soft-core to genome composition analysis (optional, requires -c [OMCL|COGS])
This program uses BLAST (and optionally HMMER/Pfam) to define clusters of 'orthologous'
genomic sequences and pan/core-genome gene sets. Several algorithms are available
and search parameters are customizable. It is designed to process (in a HPC computer
cluster) files contained in a directory (-d), so that new .faa/.gbk files can be added
while conserving previous BLAST results. In general the program will try to re-use
previous results when run with the same input directory.
For the details of each option, please have a look a the extensive get_homologues manual. Here we can review only some of the most important ones.
GET_HOMOLOGUES allows users to run 3 different clustering algorithms (\(BDBH\), \(COGtriangles\) and \(OMCL\)) to define homologous gene families, as defined in Table 1.
name | option | comment |
---|---|---|
BDBH | default | Starting from a reference genome, keep adding genomes stepwise, while storing the sequence clusters that result of merging the best bidirectional best hits |
COGS | -G | Merges triangles of inter-genomic symmetrical best matches, as described in PubMed=20439257 |
OMCL | -M | OrthoMCL v1.4, uses the Markov Cluster Algorithm to group sequences, with inflation (-F) controlling cluster granularity, as described in PubMed=12952885 |
To learn all details around these algorithms, please read the online GET_HOMOLOGUES manual
The only required option is either -i, used to choose an input file, or preferably -d instead, which indicates an input folder.
In previous versions only files with extensions .fa[a] and .gb[k] were considered when parsing the -d directory. Currently, GZIP- or BZIP2-compressed input files are also accepted!
By using .faa input files in theory you might only calculate clusters of protein sequences. In contrast, the advantage of using .gbk files is that you obtain both nucleotide and protein clusters. If both types of input files are combined, only protein clusters will be produced. However, if each input .faa file has a twin .fna file in place, containing the corresponding nucleotide sequences in the same order, the program will attempt to produce the corresponding clusters of nucleotide sequences. The possible input file combinations are summarized in Table 1:
input file extensions | output clusters |
---|---|
.gbk | amino acid + DNA sequence |
.faa | amino acid sequence |
.gbk & .faa | amino acid sequence |
.faa & .fna | amino acid + DNA sequence |
.gbk & .faa & .fna | amino acid + DNA sequence |
In summary, the use of an input folder or directory (-d) containing GenBank files is recommended, as it allows for new files to be added there in the future, reducing the computing required for updated analyses. For instance, if a user does a first analysis with 5 input genomes today, it is possible to check how the resulting clusters would change when adding an extra 10 genomes tomorrow, by copying these new 10 .faa / .gbk input files to the pre-existing -d folder, so that all previous BLAST searches are re-used.
To run \(get\_homologues.pl\ -d\) you need to be in the directory on top of the one holding the GenBank files. In our case this is the one saved in \(\$top\_dir\), holding the \(pIncAC\) directory with the *gbk files.
The BDBH algorithm implemented in GET_HOMOLOGUES is a heuristic version that does NOT cluster the results of the all-against-all blast data, just those against the smallest genome (default) or a (\(-R\ refGenome\)), as shown in Fig. 2. Therefore it is not suitable for computing pan-genomes. Use it only for core-genome calculations!
Let’s instruct get_homologues.pl to compute the BDBH clusters for the 12 genomes. Make sure you are in \(\$top\_dir\) and issue the following command, which will compute a core-genome based on the BDBH (default) clustering algorithm, excluding inparaloges (-e)
get_homologues.pl -d pIncAC -t 12 -e -n 2
#Note: the command above is equivalent to get_homologues.pl -d
pIncAC -e -n 2
, as get_homologues.pl will by default use
-t = number_of_genomes available in the \(\$gbk\_dir\)
# 1. run GET_HOMOLOGUES to compute the set of core-genome clusters using the BDBH,
# get_homologues.pl -d pIncAC &> log.BDBH_def &
get_homologues.pl -d pIncAC -t 12 -e -n 2 # BDBH clusters containing sequences for the 12 genomes (-t 12)
# excluding those with inparalogues (-e):
# use 2 processors for the job
# This will take ~ 3 minutes
# version 26022020
# results_directory=/home/vinuesa/pangenomics/pIncAC_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1
# diamond job:0
# checking input files...
# Salmonella_enterica_33676_pIncAC.gbk 185
# Salmonella_enterica_YU39_pIncAC.gbk 178
# Salmonella_enterica_pAM04528.gbk 180
# Salmonella_enterica_sv_Kentucky.gbk 164
# _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158
# _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173
# _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172
# _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189
# _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224
# _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148
# _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173
# _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137
# 12 genomes, 2081 sequences
# taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9
# mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_ (_algBDBH)
# running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_33676_pIncAC.gbk.fasta
# running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_YU39_pIncAC.gbk.fasta
# ...
# running BLAST searches ...
# done
# ...
# concatenating and sorting BLAST/DIAMOND results...
# sorting _Salmonella_enterica_33676_pIncAC.gbk results (0.097MB)
# sorting _Salmonella_enterica_YU39_pIncAC.gbk results (0.091MB)
# sorting _Salmonella_enterica_pAM04528.gbk results (0.1MB)
# clustering inparalogues in _Escherichia_coli_plasmid_peH4H_NC_012690.gbk
# ...
# done
# parsing blast result! (/home/vinuesa/cursos/pangenomics_PDCBq_Nov20/test_sequences/pIncAC_homologues/tmp/all.blast , 1.1MB)
# parsing file finished
# creating indexes, this might take some time (lines=2.25e+04) ...
# construct_taxa_indexes: number of taxa found = 12
# number of file addresses/BLAST queries = 2.1e+03
# clustering orthologous sequences
# clustering inparalogues in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk (reference)
# 0 sequences
# clustering inparalogues in Salmonella_enterica_33676_pIncAC.gbk
# 4 sequences
# finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_33676_pIncAC.gbk
# 59 sequences
# clustering inparalogues in Salmonella_enterica_YU39_pIncAC.gbk
# 3 sequences
# finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_YU39_pIncAC.gbk
# 84 sequences
# ...
# looking for valid ORF clusters (n_of_taxa=12)...
# number_of_clusters = 33
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_
# runtime: 218 wallclock secs ( 4.40 usr 0.75 sys + 174.83 cusr 17.86 csys = 197.84 CPU)
# RAM use: 33.7 MB
The results of the previous and following get_homologue.pl runs can be found in the pIncAC_homologues/ directory, which includes the all-against-all blast results.
ls -1d *
pIncAC
pIncAC_homologues
We will explore its contents later on.
Let’s now compute the pan-genome clusters with the COGtriangles and OMCL algorithms, which will use the blastp results already available in pIncAC_homologues/ from the previous BDBH run.
Run with \(-t\ 0\) to compute clusters of all sizes, which is required for the pan-genome analysis.
get_homologues.pl -d pIncAC -G -t 0
get_homologues.pl -d pIncAC -G -t 0 # COGtriangles, computing clusters of all sizes (-t 0)
# finds 408 clusters in 14 wallclock secs, as it recycles
# the blast results of the previous run.
# version 26022020
# results_directory=/home/vinuesa/pangenomics/pIncAC_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1
# diamond job:0
# checking input files...
# Salmonella_enterica_33676_pIncAC.gbk 185
# Salmonella_enterica_YU39_pIncAC.gbk 178
# Salmonella_enterica_pAM04528.gbk 180
# Salmonella_enterica_sv_Kentucky.gbk 164
# _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158
# _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173
# _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172
# _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189
# _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224
# _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148
# _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173
# _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137
# 12 genomes, 2081 sequences
# taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9
# mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_ (_algCOG)
# skipped genome parsing (pIncAC_homologues/tmp/selected.genomes)
# running BLAST searches ...
# done
# parsing file (COG)
# parsing file (COG) finished
# creating indexes, this might take some time (lines=2.25e+04) ...
# construct_taxa_indexes: number of taxa found = 12
# number of file addresses/BLAST queries = 2.1e+03
# clustering orthologous sequences
# checking lineage-specific expansions
# making COGs
# prunning COGs
# done
# add_unmatched_singletons : 0 sequences, 0 taxa
# looking for valid ORF clusters (n_of_taxa=0)...
# number_of_clusters = 408
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_
# runtime: 15 wallclock secs ( 0.81 usr 0.14 sys + 2.46 cusr 0.58 csys = 3.99 CPU)
# RAM use: 28.3 MB
Run with \(-t\ 0\) to compute clusters of all sizes, which is required for the pan-genome analysis
Option \(-c\) for generating pan-genome composition tables to be used for plotting the core decay and pan-genome growth curveas as a function of increasing number of genomes.
get_homologues.pl -d pIncAC/ -M -t 0 -c
get_homologues.pl -d pIncAC/ -M -t 0 -c # OMCL, finds 393 clusters in 5 wallclock secs.
# ...
# identifying inparalogs in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk
# 0 sequences
# running MCL (inflation=1.5) ...
# running MCL finished
# find_OMCL_clusters: parsing clusters (/home/vinuesa/pangenomics/pIncAC_homologues/tmp/all_ortho.mcl)
# adding _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk: core=32 pan=336
# pan-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=pIncAC_homologues/pan_genome_algOMCL.tab
genomes mean stddev | samples
0 167 22 | 172 170 183 137 164 217 170 169 137 154
1 215 22 | 215 206 184 203 239 262 229 208 200 201
2 242 15 | 246 245 236 247 258 269 242 243 216 220
3 273 22 | 274 281 292 282 301 288 270 262 217 263
4 291 21 | 295 322 294 288 301 312 307 268 251 268
5 307 15 | 300 329 302 307 303 317 315 315 270 309
6 316 17 | 302 335 307 307 303 336 334 323 282 330
7 317 18 | 303 336 307 312 303 340 335 324 282 331
8 324 17 | 340 336 317 320 309 341 335 324 282 331
9 330 10 | 343 337 336 321 312 341 335 331 314 331
10 334 9 | 343 345 336 321 331 341 343 332 317 331
11 341 4 | 348 346 339 340 338 341 344 340 335 336
# core-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=pIncAC_homologues/core_genome_algOMCL.tab
genomes mean stddev | samples
0 167 22 | 172 170 183 137 164 217 170 169 137 154
1 112 22 | 123 106 165 97 134 116 89 90 94 109
2 93 11 | 90 83 105 94 117 81 83 88 92 95
3 75 8 | 74 62 83 81 74 74 62 72 91 75
4 65 13 | 55 58 78 57 72 53 43 69 89 74
5 57 13 | 53 57 75 49 69 53 35 42 69 72
6 49 10 | 53 42 54 49 61 41 35 34 66 51
7 44 8 | 53 35 46 43 61 40 35 34 41 51
8 42 6 | 51 34 44 41 45 40 35 34 40 51
9 38 5 | 50 34 33 41 40 39 34 34 35 43
10 36 4 | 42 32 33 40 32 39 32 34 34 42
11 32 0 | 32 32 32 32 32 32 32 32 32 32
# add_unmatched_singletons : 117 sequences, 12 taxa
# looking for valid ORF clusters (n_of_taxa=0)...
# number_of_clusters = 393
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_
# runtime: 3 wallclock secs ( 1.82 usr 0.13 sys + 1.79 cusr 0.26 csys = 4.00 CPU)
# RAM use: 30.3 MB
Remember that for the OMCL clustering we ran the following code:
get_homologues.pl -d pIncAC/ -M -t 0 -c
The \(-c\) flag tells get_homologues.pl to compute the core- and pan-genome composition data by randomly sampling 10 times pairs of genomes to compute the observed numbers of common and new genes. The results are written to the core_genome_algOMCL.tab and pan_genome_algOMCL.tab files, found in the pIncAC_homologues results directory. These tables are used as input for plot_pancore_matrix.pl, which renders the graphics and fits different functions to the data.
plot_pancore_matrix.pl -h
[options]:
-h this message
-i input .tab data file (required, generated by get_homologues.pl)
-f type of fit [pan|core_Tettelin|core_Willenbrock|core_both]
(default: core_Tettelin, PubMed:16172379,18088402)
-F font scale for fitted formulas (optional, default=0.8)
-a save snapshots for animations in dir (optional, example: -a animation)
cd pIncAC_homologues/
# save the absolute path to this directory in the blast_dir variable
blast_dir=$(pwd)
# these are the relevant input files: core_genome_algOMCL.tab and pan_genome_algOMCL.tab
ls *.tab
# plot core-decay curves (Tettelin & Willenbrock fits)
plot_pancore_matrix.pl -i core_genome_algOMCL.tab -f core_both
# plot pan-genome growth curves (Tettelin's exponential fit)
plot_pancore_matrix.pl -i pan_genome_algOMCL.tab -f pan
# if you logged in with ssh -X, you may use eog (Eye of GNOME) or firefox to display the png files
# Note: the & at the end of the command runs it in the background
eog core_genome_algOMCL.tab_core_both.png &
eog pan_genome_algOMCL.tab_pan.png &
Figure 3. The core-genome decay curves for 12 pIncA/C plasmids
Figure 4. The pan-genome growth curve for 12 pIncA/C plasmids
Based on the output shown above, what type of pan-genome (closed or open) do you think best represents this collection of pIncA/C plasmids?
Before exploring the contents of the get_homologues.pl runs stored in the [pIncAC]_homologues directory, a brief note on how to setup multiple sequential get_homologues.pl runs in non-interactive mode. This is the best option to run large datasets which may take hours or days to finish.
This is easy. just type the get_homologues.pl invocations that you would like to run into a text file (\(script\)), which i have named run_get_hom.batch holding the code shown below:
get_homologues.pl -d pIncAC -t 12 -e -n 2 &> log.BDBH get_homologues.pl -d pIncAC -G -t 0 &> log.COG get_homologues.pl -d pIncAC/ -M -t 0 -c &> log.OMCL
The run_get_hom.batch \(script\) would run exactly the same analyses that we have run interactively, but in a non-interactive mode.
Note that the output that each get_homologues.pl invocation would normally print to STDOUT (the screen) is redirected to a file called log.[ALGORITHM], making use of the “&>” syntax which instructs to redirect STDOUT and STDERR to the file indicated after the symbols.
Now we can call the batch script in the following manner:
nohup bash run_get_hom.batch &> /dev/null
&
The \(nohup\) (no hang-up) command combined with the strange-looking “&> /dev/null &” tells the server to run the script in the background and to make the run independent of the login session. This means that after issuing the command, the script runs detached from our terminal. We can shutdown our computer and the process will continue executing on the server.
After some time, we can check the results of the runs by listing the files and directories as shown:ls -1 log.BDBH log.COG log.OMCL pIncAC pIncAC_homologues run_get_hom.batch
Another useful command to know is tail -f log.BDBH
,
which can be called on any of the logfiles while they are being
written to disk, showing their last 10 lines as they’re appended to the
tail of the file in real time. Nice!
As previously mentioned, GET_HOMOLOGUES writes the results in the **[pIncAC]_homologues/ directory**, stored in the \(\$blast\_dir\) variable.
Therein you will find the following files and directories
The run-specific output directories are named using the following components to make them unique for each run configuration:
BuchaphCc_f0_alltaxa_algBDBH_e0_ | | | | | | | | | -e option was not used (inparalogues are in) | | | the clustering algorithm is BDBD (default) | | all clusters contain at least 1 sequence from each taxa (default -t behavior) | -f option not used (no length filtering) reference proteome
Let’s explore the results directory pIncAC_homologues/ using basic \(Shell\) commands:
# assumes that we are in $blast_dir (the results directory pIncAC_homologues/)
cd $blast_dir
# list
ls | less
ls *faa *fna *fasta
ls *faa *fna *fasta | wc
head -5 Salmonella_enterica_YU39_pIncAC.gbk.f*a
ls Salmonella_enterica_YU39_pIncAC.gbk.f*a.*
# find the subdirectories holding the run-specific clusters of homologous sequences (gene families)
find . -type d
# exlore results (clusters) in the KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_
# which hods the BDBH results for the 12 genomes (-t 12), without inparalogues (-e)
ls
ls *fna | wc -l
ls *faa | wc -l
grep -c '>' *fna
grep -c '>' *faa
GET_HOMOLOGUES is unique among available pangenomics software in many respects. One of them is its capacity to compute consensus core-genome clusters from the solutions generated by the three clustering algorithms it implements, and consensus pan-genome clusters from COGtriangles and OMCL clustering results, as demonstrated below.
compare_clusters.pl -h
[options]:
-h this message
-d comma-separated names of cluster directories (min 1 required, example -d dir1,dir2)
-o output directory (required, intersection cluster files are copied there)
-n use nucleotide sequence .fna clusters (optional, uses .faa protein sequences by default)
-r take first cluster dir as reference set, which might contain (optional, by default cluster dirs are expected
a single representative sequence per cluster to be derived from the same taxa; overrides -t,-I)
-s use only clusters with syntenic genes (optional, parses neighbours in FASTA headers)
-t use only clusters with single-copy orthologues from taxa >= t (optional, default takes all intersection clusters; example -t 10)
-I produce clusters with single-copy seqs from ALL taxa in file (optional, example -I include_list; overrides -t)
-m produce intersection pangenome matrices (optional, ideally expects cluster directories generated
with get_homologues.pl -t 0)
-x produce cluster report in OrthoXML format (optional)
-T produce parsimony-based pangenomic tree
# Compute consensus core-genome clusters using compare_clusters.pl of the GET_HOMOLOGUES suite.
# Note that we will run the compare_clusters.pl script twice, once with the -n flag
# (to compute the consesus clusters at the DNA level, producing *.fna files)
# and a second instance without the flag, to get the protein clusters (*.faa files)
# assumes we are in $blast_dir
cd $blast_dir
# find the subdirectories holding the run-specific results
find . -type d
# Compute the core genome; use -t 12 to keep only clusters containing a copy of the gene in each of the 12 genomes analyzed
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12 -n
# repeat command, without -n, to compute the clusters at the protein level (*faa files)
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12
# Lets explore the core_BCM directory
cd core_BCM
ls *fna | wc -l
ls *faa | wc -l
grep -c '>' *faa
# confirm that all 31 clusters contain only one sequence from each plasmid/proteome
grep '>' *faa | cut -d\| -f2,3 | sort | uniq -c
# display the Venn diagram of the consensus core-genome clusters
ls *svg
eog venn_t12.svg &
Figure 5. Venn diagram depicting the consensus core-genome of 31 genes for the 12 input pIncA/C sequences.
To compute a consensus pan-genome, only the \(COGtriangles\) and \(OMCL\) algorithms should be used! as they don’t require a reference sequence.
Run compare_clusters.pl with the \(-t\ 0\) option, to consider clusters of all occupancy sizes (from singletons to multigene families), as demonstrated below.
# Compute consensus pan-genome clusters and matrix using compare_clusters.pl of the
# GET_HOMOLOGUES suite.
# Note that we run the compare_clusters.pl script twice, once with the -n and -m flags
# (to compute the consesus clusters at the DNA level, *.fna files
# and the pan-genome matrix) and a second instance without these flags,
# to get the protein clusters (*.faa files).
# Note also that we exclude the directory holding the BDBH clusters,
# as these are not suitable to compute a proper pan-genome matrix, since the
# BDBH clusters always contain the reference sequence, missing those that do not contain it.
cd $blast_dir
# get the consensus COG-OMCL pan-genome clusters with -t 0 (all sizes) and -m to write pan-genome matrix
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM -m -n -T
# repeat without -m and -n, to generate the protein clustes
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM
# Explore the pan-genome clusters
cd pan_CM
# how many clusters
ls *faa | wc -l
# print a sorted list of cluster sizes (gene contents of each cluster)
grep -c '>' *faa | sed 's/:/\t/' | sort -nk2
grep -c '>' *faa | sed 's/:/\t/' | sort -nrk2 > pan-genome_cluster_compositon_stats.tsv
# lets explore the structure of the pan-genome
grep -c '>' *faa | sed 's/:/\t/' | cut -f2 | sort -n | uniq -c | awk 'BEGIN{OFS="\t"; print "size\tcount"}{print $2,$1}'
# Lets explore the structure of the pan-genome parsimony tree pangenome_matrix_t0.phylip.ph
figtree pangenome_matrix_t0.phylip.ph &
/---+ Escherichia coli plasmid pAPEC1990 61 NC 019066 gbk...
/--+
| \+ Escherichia coli UMNK88 plasmid pUMNK88 NC 017645 gbk...
|
/+ /---+ Escherichia coli plasmid pPG010208 NC 019065 gbk...
|| |
|\--+ /------------+ Aeromonas hydrophila plasmid pRA1 NC 012885 gbk...
| | |
| \------+ /------------+ Klebsiella pneumoniae plasmid pNDM KN NC 019153 gbk...
| | /-----+
/--------------+ \--+ \--------------+ Escherichia coli plasmid pNDM 1 Dok01 NC 018994 gbk...
| | |
| | \-------+ Salmonella enterica sv Kentucky gbk...
| |
| | /---------+ Escherichia coli plasmid peH4H NC 012690 gbk...
| | /--+
=+ \--+ \ Escherichia coli plasmid pAR060302 NC 012692 gbk...
| |
| \+ Salmonella enterica pAM04528 gbk...
|
+-------+ Salmonella enterica YU39 pIncAC gbk...
|
\--------------+ Salmonella enterica 33676 pIncAC gbk...
|------|------|-----|------|------|------|-----|---
0 25 50 75 100 125 150 175
substitutions/site
# display the Venn diagram of the consensus pan-genome clusters
ls *svg
eog venn_t0.svg &
Figure 6. Venn diagram depicting the consensus pan-genome clusters for the 12 input pIncA/C sequences contains 373 cluster. The corresponding pan-genome matrix is provided in the test_sequences/pan_genome/ directory
Graphical analysis of pangenome’s structure and parsing pan-genome tables to identify lineage-specific genes or family expansions can be conveniently done with the auxiliary \(script\) parse_pangenome_matrix.pl, as shown below.
parse_pangenome_matrix.pl -h
parse_pangenome_matrix.pl
[options]:
-h this message
-m input pangenome matrix .tab (required, made by compare_clusters.pl)
-s report cloud,shell,soft core and core clusters (optional, creates plot if R is installed)
-l list taxa names present in clusters reported in -m matrix (optional, recommended before using -p option)
-x produce matrix of intersection pangenome clusters (optional, requires -s)
-I use only taxon names included in file (optional, ignores -A,-g,-e)
-A file with taxon names (.faa,.gbk,.nucl files) of group A (optional, example -A clade_list_pathogenic.txt)
-B file with taxon names (.faa,.gbk,.nucl files) of group B (optional, example -B clade_list_symbiotic.txt)
-a find genes/clusters which are absent in B (optional, requires -B)
-g find genes/clusters present in A which are absent in B (optional, requires -A & -B)
-e find gene family expansions in A with respect to B (optional, requires -A & -B)
-S skip clusters with occupancy <S (optional, requires -x/-a/-g, example -S 2)
-P percentage of genomes that must comply presence/absence (optional, default=100, requires -g)
-p plot pangenes on the genome map of this group A taxon (optional, example -p 'Escherichia coli K12',
requires -g, -A & -B, only works with clusters
derived from input files in GenBank format)
Use option -s to report and graphically depict cloud, shell, soft core and core clusters
parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab
-s
parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -s
# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -I -A -B -a 0 -g 0 -e 0 -p -s 1 -l 0 -x 0 -P 100 -S 0
# matrix contains 373 clusters and 12 taxa
# cloud size: 183 list: pangenome_matrix_t0__cloud_list.txt
# shell size: 130 list: pangenome_matrix_t0__shell_list.txt
# soft core size: 60 list: pangenome_matrix_t0__softcore_list.txt
# core size: 31 (included in soft core) list: pangenome_matrix_t0__core_list.txt
# using default colors, defined in %COLORS
# globals controlling R plots: $YLIMRATIO=1.2
# shell bar plots: pangenome_matrix_t0__shell.png , pangenome_matrix_t0__shell.pdf
# shell circle plots: pangenome_matrix_t0__shell_circle.png , pangenome_matrix_t0__shell_circle.pdf
# pan-genome size estimates (Snipen mixture model PMID:19691844): pangenome_matrix_t0__shell_estimates.tab
Core.size Pan.size BIC LogLikelihood
2 components 31 375 2897.63753894041 -1439.93640184074
3 components 24 436 1648.86318314942 -809.627645525601
4 components 21 507 1626.68849010487 -792.618720583683 <== best BIC VALUE!
5 components 18 508 1638.26140215658 -792.483598189895
6 components 0 508 1649.91033432187 -792.386485852895
7 components 6 508 1661.76417317612 -792.391826860376
8 components 6 508 1673.59937473543 -792.387849220388
9 components 10 507 1685.47784762243 -792.405507244245
10 components 20 505 1697.57598545494 -792.532997740852
Sample 31 373 NA NA
# occupancy stats:
cloud shell soft_core core
Salmonella_enterica_33676_pIncAC.gbk 52 46 59 31
Salmonella_enterica_YU39_pIncAC.gbk 43 56 60 31
Salmonella_enterica_pAM04528.gbk 5 92 59 31
Salmonella_enterica_sv_Kentucky.gbk 15 78 58 31
_Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 27 68 53 31
_Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 5 90 60 31
_Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 4 88 60 31
_Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 3 105 60 31
_Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 51 98 60 31
_Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 9 70 59 31
_Escherichia_coli_plasmid_peH4H_NC_012690.gbk 4 86 53 31
_Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 28 51 50 31
# The command produces also lists of genes in each pan-genome "compartment"
ls *list.txt
# >>>>>> Textual exploration of the pangenome_matrix_t0__*_list.txt
# inspect the pangenome_matrix_t0__*_list.txt for the presence plasmid replication and mobilization genes
grep -Ei 'mob|tra|rep' pangenome_matrix_t0*txt | egrep -v 'transpo|trans'
# 9.5 inspect the pangenome_matrix_t0__*_list.txt for the presence of
# beta-lactamase or tetracycline resistance genes
grep -Ei 'bla|NDM|tet|efflux|resist|aminogly|sul|aad|RND' *txt
# 10. Display the barplot depicting the pan-genome structure of 12 pIncA/C plasmids
eog pangenome_matrix_t0__shell.png &
Figure 7. Barplot of the pan-genome structure of 12 pIncA/C plasmids
The final GET_HOMOLOGUES feature I’m gonna show is how to parse the pan-genome matrix to identify lineage-specific genes and family expansions.
As shown below on the pan-genome parsimony tree we computed previously with compare_clusters.pl -T, there are two plasmids labeled as pNDM. These contain the “New Dehli metallo-betaclactamase” NDM1, which confers resistance to essentially all beta-lactams, including carbepenems, except aztreonam.
/---+ Escherichia coli plasmid pAPEC1990 61 NC 019066 gbk... /--+ | \+ Escherichia coli UMNK88 plasmid pUMNK88 NC 017645 gbk... | /+ /---+ Escherichia coli plasmid pPG010208 NC 019065 gbk... || | |\--+ /------------+ Aeromonas hydrophila plasmid pRA1 NC 012885 gbk... | | | | \------+ /------------+ Klebsiella pneumoniae plasmid pNDM KN NC 019153 gbk... <== | | /-----+ /--------------+ \--+ \--------------+ Escherichia coli plasmid pNDM 1 Dok01 NC 018994 gbk... <== | | | | | \-------+ Salmonella enterica sv Kentucky gbk... | | | | /---------+ Escherichia coli plasmid peH4H NC 012690 gbk... | | /--+ =+ \--+ \ Escherichia coli plasmid pAR060302 NC 012692 gbk... | | | \+ Salmonella enterica pAM04528 gbk... | +-------+ Salmonella enterica YU39 pIncAC gbk... | \--------------+ Salmonella enterica 33676 pIncAC gbk... |------|------|-----|------|------|------|-----|--- 0 25 50 75 100 125 150 175 substitutions/site
As seen on the tree, the two pNDMs form a monophyletic group.
Let’s now find out what unique genes are carried specifically by these two plasmids.
# Detection of lineage-specific genes using parse_pangenome_matrix.pl
# assumes that you are still in /home/vinuesa/pangenomics/pIncAC_homologues/pan_CM
# find the clusters containing the NDM-1 genes
grep -l blaNDM-1 *faa
# find the plasmids that hold these genes containing the NDM-1 genes
for f in $(grep -l blaNDM-1 *faa); do echo "### $f"; grep '>' $f | cut -d\| -f4; done
# generate the lists of genomes to be compared for lineage specific genes (in list A vs. B)
panGmat_dir=$(pwd)
cd ${gbk_dir}
ls *gbk | grep pNDM > listA_pNDB
ls *gbk | grep -v pNDM > listB_nonNDB
cd $panGmat_dir
# >>> parse the pangenome matrix file to find the listA-specific genes
parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B $gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab
# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -I -A /home/vinuesa/pangenomics/pIncAC/listA_pNDB -B $/pangenomics/pIncAC/listB_nonNDB -a 0 -g 1 -e 0 -p -s 0 -l 0 -x 0 -P 100 -S 0
# matrix contains 373 clusters and 12 taxa
# taxa included in group A = 2
# taxa included in group B = 10
# finding genes present in A which are absent in B ...
# file with genes present in set A and absent in B (19): pangenome_matrix_t0__pangenes_list.txt
# explore the file with genes present in set A and absent in B (19): pangenome_matrix_t0__pangenes_list.txt
# i.e. present only in the NDM-1 expressing plasmids
cat pangenome_matrix_t0__pangenes_list.txt
# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -A /home/vinuesa/pangenomics/pIncAC/listA_pNDB -B /home/vinuesa/pangenomics/pIncAC/listB_nonNDB -g 1 -e 0 -p -P 100 -S 0
# genes present in set A and absent in B (19):
2020_hypothetical_protein.fna
2021_type_I_restriction-m...fna
2022_anticodon_nuclease.fna
2023_restriction_modifica...fna
2024_ATP-dependent_DNA_he...fna
2025_DNA_replication_and_...fna
2026_type_I_site-specific...fna
2027_Tn7-like_transpositi...fna
2028_Tn7-like_transpositi...fna
2029_Tn7-like_transpositi...fna
2030_Tn7-like_transpositi...fna
2033_truncated_Rhs_family...fna
2036_GroES.fna
2037_CutA1.fna
2038_oxidoreductase_domai...fna
2039_phosphoribosylanthra...fna
2040_bleMBL.fna
2041_NDM-1.fna
2043_RmtB.fna
As seen from the output above, the cluster numbers are consecutive, indicating that the pNDM genes are part of a larger gene cluster. This can be visually confirmed by mapping the specific genes on the map of one of the “listA” genomes
parse_pangenome_matrix.pl
[options]:
-h this message
-m input pangenome matrix .tab (required, made by compare_clusters.pl)
-s report cloud,shell,soft core and core clusters (optional, creates plot if R is installed)
-l list taxa names present in clusters reported in -m matrix (optional, recommended before using -p option)
-x produce matrix of intersection pangenome clusters (optional, requires -s)
-I use only taxon names included in file (optional, ignores -A,-g,-e)
-A file with taxon names (.faa,.gbk,.nucl files) of group A (optional, example -A clade_list_pathogenic.txt)
-B file with taxon names (.faa,.gbk,.nucl files) of group B (optional, example -B clade_list_symbiotic.txt)
-a find genes/clusters which are absent in B (optional, requires -B)
-g find genes/clusters present in A which are absent in B (optional, requires -A & -B)
-e find gene family expansions in A with respect to B (optional, requires -A & -B)
-S skip clusters with occupancy <S (optional, requires -x/-a/-g, example -S 2)
-P percentage of genomes that must comply presence/absence (optional, default=100, requires -g)
-p plot pangenes on the genome map of this group A taxon (optional, example -p 'Escherichia coli K12',
requires -g, -A & -B, only works with clusters
derived from input files in GenBank format)
Let’s use option \(-p\) to plot pangenes on the genome map of _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk, a representative of “group A” genomes saved in listA_pNDB.
parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B
$gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab -p
_Klebsiella_pneumoniae_plasmid
# plot the positions of listA-specific genes on one of the genomes of this group
# use option -l to list taxa names present in clusters reported in -m matrix
parse_pangenome_matrix.pl -l -m pangenome_matrix_t0.tab
parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B $gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab -p "Klebsiella pneumoniae"
# visualize the mapping of the pNDM-specific genes on _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk, as a representative of "group A"
eog pangenome_matrix_t0_Klebsiellapneumoniae_NC_019153.png &
After localizing the genomic region of interest, we can plot it with add hoc tools like snapgene-viewer
Before leaving, run:
# Deactivate the get_homologues conda environment
conda deactivate
to deactivate the get_homologues conda environment.
That’s it, cheers.
GET_PHYLOMARKERS can be used to estimate maximum-likelihood phylogenies from core- and pan-genome datasets computed by GET_HOMOLOGUES.
The main usage of the GET_PHYLOMARKERS pipeline is to estimate core-genome phylogenies with selected optimal markers.
To do so, it implements a series of sequential filters (Fig. 8 and explained below) to identify markers with optimal attributes for phylogenomic inference, that satisfy the following criteria:
The GET_PHYLOMARKERS package implements many analysis possibilities. Here we will review only some of them. Visit the GET_PHYLOMARKERS manual
estimate_pangenome_phylogenies.sh is the core \(script\) of the GET_PHYLOMARKERS distribution. It implements many sophisticated phylogenetic analysis features using state-of-the-art methods and software. It is highly configurable, as reflected in the many options that an advanced user can set to fine-tune its behaviour.
run_get_phylomarkers_pipeline.sh v.2.2.8.1_16Jul2019 OPTIONS:
REQUIRED:
-R <integer> RUNMODE
1 select optimal markers for phylogenetics/phylogenomics (genomes from different species)
2 select optimal markers for population genetics (genomes from the same species)
-t <string> type of input sequences: DNA|PROT
OPTIONAL:
-h flag, print this short help notes
-H flag, print extensive Help and details about search modes and models
-A <string> Algorithm for tree searching: <F|I> [FastTree|IQ-TREE] [default:I]
-c <integer> NCBI codontable number (1-23) for pal2nal.pl to generate codon alignment [default:11]
-C flag to print codontables
-D flag to print debugging messages; please use if you encounter problems executing the code [default: 0]
-e <integer> select gene trees with at least (min. = 4) external branches [default: 4]
-f <string> GET_HOMOLOGUES cluster format <STD|EST> [default: STD]
-k <real> kde stringency (0.7-1.6 are reasonable values; less is more stringent) [default: 1.5]
-K flag to run molecular clock test on codon alignments [default: 0]
-l <integer> max. spr length (7-12 are reasonable values) [default: 10]
-m <real> min. average support value (0.7-0.8 are reasonable values) for trees to be selected [default: 0.7]
-M <string> base Model for clock test (use one of: GTR|TrN|HKY|K2P|F81); uses +G in all cases [default: GTR]
-n <integer> number of cores/threads to use [default: all cores]
-N <integer> number of IQ-TREE searches to run [only active with -T high] [default: 5]
-q <real> quantile (0.95|0.99) of Chi-square distribution for computing molec. clock p-value [default: 0.99]
-r <string> root method (midpoint|outgroup) [default: midpoint]
-s <integer> number of spr rounds (4-20 are reasonable values) for FastTree tree searching [default: 4]
-S <string> quoted 'comma-separated list' of base models to be evaluated by IQ-TREE
when estimating the species tree from the concatenated supermatrix (see -H for details).
If no -S is passed, then sinlge default models are used, as shown below
<'JC,F81,K2P,HKY,TrN,TNe,K3P,K81u,TPM2,TPM2u,TPM3,TPM3u,
TIM,TIMe,TIM2,TIM2e,TIM3,TIM3e,TVM,TVMe,SYM,GTR'> for DNA alignments [default: GTR]
<'BLOSUM62,cpREV,Dayhoff,DCMut,FLU,HIVb,HIVw,JTT,JTTDCMut,LG,
mtART,mtMAM,mtREV,mtZOA,Poisson,PMB,rtREV,VT,WAG'> for PROT alignments [default: LG]
-T <string> tree search Thoroughness: high|medium|low|lowest (see -H for details) [default: medium]
-v flag, print version and exit
-V flag, print software versions
INVOCATION EXAMPLES:
1. default on DNA sequences (uses IQ-TREE evaluating a subset of models specified in the detailed help)
run_get_phylomarkers_pipeline.sh -R 1 -t DNA
2. thorough FastTree searching and molecular clock analysis on DNA sequences using 10 cores:
run_get_phylomarkers_pipeline.sh -R 1 -t DNA -A F -k 1.2 -m 0.7 -s 20 -l 12 -T high -K -M HKY -q 0.95 -n 10
3. FastTree searching on a huge protein dataset for fast inspection
run_get_phylomarkers_pipeline.sh -R 1 -A F -t PROT -m 0.6 -k 1.0 -T lowest
4. To run the pipeline on a remote server, we recommend using the nohup command upfront, as shown below:
nohup run_get_phylomarkers_pipeline.sh -R 1 -t DNA -S 'TNe,TVM,TVMe,GTR' -k 1.0 -m 0.75 -T high -N 5 &> /dev/null &
5. Run in population-genetics mode (generates a table with descritive statistics for DNA-polymorphisms
and the results of diverse neutrality test)
run_get_phylomarkers_pipeline.sh -R 2 -t DNA
NOTES
1: run from within the directory holding core gene clusters generated by get_homologues.pl -e or
compare_clusters.pl with -t no_genome_gbk files (core genome: all clusters with a single gene copy per genome)
2: If you encounter any problems, please run the script with the -D -V flags added at the end of the command line,
redirect STOUT to a file and send us the output, so that we can better diagnose the problem.
e.g. run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 1.0 -m 0.7 -s 8 -l 10 -T high -K -D &> ERROR.log
For the details, please visit the manual
# assumes that you are within test_sequences/pIncAC_homologues/ or in test_sequences/core_BCM
cd core_BCM
# 1. make sure we have the same nuber of faa and fna cluster files;
# Note that GET_PHYLOMARKERS will check that for you automatically ;)
find . -name '*.faa' | wc -l # 31
find . -name '*.fna' | wc -l # 31
# 2.1 view the help menu of the main script run_get_phylomarkers_pipeline.sh
run_get_phylomarkers_pipeline.sh -h
# 2.2 run the pipeline under default values. See the manual for the many additional options
run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 1.0 # takes 55 seconds on the above-mentioned machine
GET_PHYLOMARKERS prints informative progress messages to screen and to a logfile. Compare them with the pipeline’s flowchart shown in Fig. 8 to make sure you understand what computations are performed at each step.
# >>> explore the directories:
ls
cd get_phylomarkers_run_AIR1tDNA_k1.5_m0.7_Tmedium_30Jan18 # holds the compressed input
# FASTAS and alignments
# >>> cd into PhiPac/ dir, which holds the results of the recombination tests.
cd PhiPack
# have a peak a the file
head Phi_results_30Jan18.tsv
## ./1961_hypothetical_protein_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1962_DNA_topoisomerase_II_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1964_hypothetical_protein_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1966_site-specific_DNA_me_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1967_hypothetical_protein_cdnAln_Phi.log 1 1 TOO_FEW_INFORMATIVE_SITES
## ./1968_hypothetical_protein_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1969_putative_type_I_rest_cdnAln_Phi.log 1 1 TOO_FEW_INFORMATIVE_SITES
## ./1970_KorB_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1971_ParA_cdnAln_Phi.log 1.00e+00 1.00e+00
## ./1984_hypothetical_protein_cdnAln_Phi.log 1.00e+00 1.00e+00
Depending on the type of the input sequences (-t DNA|PROT), the results of the different phylogenetic filters performed on gene trees will live either in the non_recomb_cdn_alns (-t DNA) or in the non_recomb_FAA_alns (-t PROT) directories.
# >>> cd into the directory holding the non-recombinant sequences,
# which contains the results of several phylogenetic attributes of the loci
cd non_recomb_cdn_alns/
# 1.1 to identify the kde-test outlier loci see the contents of kde_outliers, in this case only 1 "outlier" locus was detected
ls kde_outliers
# 1.2 the details of the kdetrees test are provided in tsv-formatted file kde_dfr_file_all_gene_trees.tre.tab
head kde_dfr_file_all_gene_trees.tre.tab
## file kde_topo_dens kde_topo_test kde_bl_dens kde_bl_test kde_bl_topo_test
## 1961_hypothetical_protein_cdnAln.fasta.treefile 0.288751935193142 ok 63.2675110841703 ok ok
## 1962_DNA_topoisomerase_II_cdnAln.fasta.treefile 0.252919761356219 ok 28.7241410392773 ok ok
## 1964_hypothetical_protein_cdnAln.fasta.treefile 0.287751455109637 ok 17.2677657886815 ok ok
## 1966_site-specific_DNA_me_cdnAln.fasta.treefile 0.240400359789596 outlier 27.0113341481227 ok outlier
## 1967_hypothetical_protein_cdnAln.fasta.treefile 0.252639796140819 ok 95.8112601171774 ok ok
## 1968_hypothetical_protein_cdnAln.fasta.treefile 0.297536609843425 ok 96.2903342166101 ok ok
## 1969_putative_type_I_rest_cdnAln.fasta.treefile 0.314498361683638 ok 95.2603170008101 ok ok
## 1970_KorB_cdnAln.fasta.treefile 0.331685645962565 ok 79.0447022830493 ok ok
## 1971_ParA_cdnAln.fasta.treefile 0.334991727858742 ok 55.4211725899592 ok ok
# 2.1 have a look at the counts of best-fitting models selected
cat IQT_best_model_counts_for_gene_trees.tsv
## model count
## HKY+F 2
## HKY+F+G4 1
## K2P 26
## K2P+G4 1
## TIMe 1
# 2.2 and locus-specific stats are provied in IQT_DNA_gene_tree_Tmedium_stats.tsv
head IQT_DNA_gene_tree_Tmedium_stats.tsv
## alignment wc_secs CPU_secs lnL model s_type
## ./1961_hypothetical_protein_cdnAln.fasta.log 0.164 0.112 -1107.994 K2P IQTdnaTmedium
## ./1962_DNA_topoisomerase_II_cdnAln.fasta.log 0.251 0.184 -4124.482 HKY+F+G4 IQTdnaTmedium
## ./1964_hypothetical_protein_cdnAln.fasta.log 0.284 0.172 -1225.145 K2P+G4 IQTdnaTmedium
## ./1966_site-specific_DNA_me_cdnAln.fasta.log 0.204 0.108 -2183.089 K2P IQTdnaTmedium
## ./1967_hypothetical_protein_cdnAln.fasta.log 0.089 0.072 -870.935 K2P IQTdnaTmedium
## ./1968_hypothetical_protein_cdnAln.fasta.log 0.099 0.076 -1179.667 K2P IQTdnaTmedium
## ./1969_putative_type_I_rest_cdnAln.fasta.log 0.064 0.060 -1111.931 K2P IQTdnaTmedium
## ./1970_KorB_cdnAln.fasta.log 0.130 0.116 -2112.818 TIMe IQTdnaTmedium
## ./1971_ParA_cdnAln.fasta.log 0.129 0.072 -1157.621 K2P IQTdnaTmedium
# 3. graphical summaries of the results of the kdetrees and the distribuions of support-values are provided as svg files.
ls -1 *svg
## dotplot_and_bxplot_kdeDensity_dist_dissim_topo_TRUE.svg
## parallel_bxplots_kdeDensity_dist_dissim_topo_TRUE-FALSE.svg
## scatterplot_for_gene_tree_support_values.svg
Figure 9 below depicts the results of the non-parametric kdetrees test, run at the default stringency level of k = 1.5. As depicted on the graph, only one outlier is detected based on the topology (lower panel). Note that this and following graphs plot the results of a small (toy) dataset. They become more useful with larger ones.
IMPORTANT NOTE: to visualize the figures, you need to access the corresponding files from your local host, as the Docker container does not provide a graphical environment with visualization tools! Just open a second terminal (or use your system’s file browser) to access the working directory and display the figures.
Figure 10 depicts a scatterplot and a histogram summarizing the distribution of mean SH-support values for the 25 gene trees that reached this point in the pipeline.
Finally we will inspect the contents of the top_15_markers_ge70perc/ directory, which holds the top-scoring markers that passed the above-mentioned filters (Fig. 1). In this directory you will find the supermatrix resulting from their concatenation and the “species-tree” estimated from it under the best-fitting model identified by ModelFinder/IQ-TREE. It contains also the resutls of RF-distances of gene trees to species-tree and the global molecular clock test results if the main script was invoked with -R 1 -t DNA -K
# >>> inspecting the contents of the top_15_markers_ge70perc/ directory
cd top_15_markers_ge70perc
# The concatenation coordinates for the supermatrix are saved in concatenation_coordinates.txt
cat concatenation_coordinates.txt
## concatenation coordinates:
## 1961_hypothetical_protein_cdnAln.fasta = 1-615
## 1967_hypothetical_protein_cdnAln.fasta = 616-1113
## 1968_hypothetical_protein_cdnAln.fasta = 1114-1785
## 1970_KorB_cdnAln.fasta = 1786-2961
## 1971_ParA_cdnAln.fasta = 2962-3642
## 1984_hypothetical_protein_cdnAln.fasta = 3643-3975
## 1989_hypothetical_protein_cdnAln.fasta = 3976-4830
## 1990_hypothetical_protein_cdnAln.fasta = 4831-5346
## 1994_DSBA_oxidoreductase_cdnAln.fasta = 5347-6204
## 1995_putative_signal_pept_cdnAln.fasta = 6205-7110
## 1996_hypothetical_protein_cdnAln.fasta = 7111-7962
## 1997_hypothetical_protein_cdnAln.fasta = 7963-8403
## 1998_hypothetical_protein_cdnAln.fasta = 8404-8922
## 1999_hypothetical_protein_cdnAln.fasta = 8923-9381
## 2012_TraF_cdnAln.fasta = 9382-10395
# some phylogenetic properties of the markers are summarized in sorted_aggregated_support_values4loci.tab
# graphical analysis of RF-distances of gene-trees to the species-tree is found in
# scatterplot_RF-dist_of_gene_trees2concat_phylo.svg
# The final tree concat_nonRecomb_KdeFilt_iqtree_GTR+F+ASC_ed.sptree can be conveniently
# visualized and edited with figtree
figtree concat_nonRecomb_KdeFilt_iqtree_GTR+F+ASC_ed.sptree &
Figure 11 depicts the RF-distances of the 15 top-scoring markers to the ML tree inferred under the GTR+F+ASC substitution model from the concatenation of these loci (shown in Fig. 12)
Figure 12 depicts the best ML “species tree” inferred under the GTR+F+ASC substitution model from the supermatrix of 15 top-scoring markers. The nodes are colored according to the legend. The values on the left correspond to approximate Bayes branch support values and those to the right to the UFBoot values described in the manual.
# assumes that you are within test_sequences/pIncAC_homologues/, or in test_sequences/pan_genome
cd pan_CM
# 1. find the pangenome_matrix
ls pangenome_matrix*
# 2. run estimate_pangenome_phylogenies.sh launching 5 independent iqtree searches,
# fitting binary (BIN) models
estimate_pangenome_phylogenies.sh -f pangenome_matrix_t0.fasta -r 5
The estimate_pangenome_phylogenies.sh reports its progress to the screen, informing that: - it created and moved into subdir iqtree_PGM_5_runs, which holds the results of the analysis - the best-fitting binary model was GTR2+FO - after determining the best model, it performs 10 independent IQ-TREE searches - Best IQ-TREE run was: abayes_UFBboot_run5.log:BEST SCORE FOUND : -1897.260 - wrote file best_PGM_IQT_abayes_UFBboot_run9_GTR2+FO.treefile in pan_CM/iqtree_PGM_5_runs/iqtree_10_runs
Lets have a look at the tree search profile
cd iqtree_PGM_10_runs/iqtree_10_runs
cat sorted_lnL_scores_IQ-TREE_searches.out
## abayes_UFBboot_run3.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run2.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run4.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run1.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run5.log:BEST SCORE FOUND : -1897.260
No wonder, in such a small dataset, all searches find exactly the same best tree. However, this will be less likely the case as the number of sequences in the dataset increases, as the size of the tree-spece increases factorialy with each new sequence.
figtree best_PGM_IQT_abayes_UFBboot_run5_GTR2+FO.treefile &
Figure 13 displays the best ML pan-genome tree, again using figtree. The nodes are colored according to the legend. The first value corresponds to approximate Bayes branch support values and second ones to the UFBoot values described in the manual.
# assumes that you are within test_sequences/pIncAC_homologues/ or test_sequences/pan_genome
cd pan_CM
# 1. find the pangenome_matrix
ls pangenome_matrix*
# 2. run estimate_pangenome_phylogenies.sh launching 10 independent iqtree searches on
# all available cores (-n 4 in this case), fitting binary (BIN) models
estimate_pangenome_phylogenies.sh -c PARS -R 3 -i pangenome_matrix_t0.phylip -n 4 -b 25 -j 10 -t 1
After changing into the boot_pars/ dir we can edit and visualize the most parsimonious tree found among 100 pars (from the PHYLIP package) searches split on 4 cores (-n 4) with 10 taxon jumples searches using again figtree. See the manual for the details.
cd boot_pars
figtree full_pars_tree_rooted_withBoot_ed.ph &
Figure 14 displays the most parsimonious pan-genome tree, again using figtree. The nodes are colored according to the legend. The nodes are colored according to the legend, which represents standard bootstrap support values computed by seqboot from the PHYLIP package.
If you are running the tutorials from a Docker image instance, do you remember how to exit the container? It’s with exit
That’s it, enjoy!
The code is developed and maintained by Pablo Vinuesa at CCG-UNAM, Mexico and Bruno Contreras-Moreira at EEAD-CSIC, Spain. It is released to the public domain under the GNU GPLv3 license.
We thank Alfredo J. Hernández and Víctor del Moral at CCG-UNAM for technical support.
We gratefully acknowledge the funding provided by DGAPA-PAPIIT/UNAM (grants IN201806-2, IN211814, IN206318 and IN209321) and CONACyT-Mexico (grants P1-60071, 179133, FC-2015-2-879 and A1-S-11242) to Pablo Vinuesa, as well as the Fundación ARAID,Consejo Superior de Investigaciones Científicas (grant 200720I038 and Spanish MINECO (AGL2013-48756-R) to Bruno Contreras-Moreira.