1 Aim of the tutorial

This tutorial will show how to use the get_homologues (Contreras-Moreira & Vinuesa, 2013) and get_phylomarkers (Vinuesa et al. 2018) software packages to perform a thorough pan-genomic and phylogenomic analyses of a small dataset of conjugative multiresistance plasmids of the pIncC (formerly IncA/C) incompatibility group of great concern in the clinical setting Ambrose et al. 2018. This is an updated version of a similar tutorial published by Vinuesa & Contreras-Moreira 2015.

1.1 GET_HOMOLOGUES and GET_PHYLOMARKERS source code and documentation


2 Running GET_HOMOLOGUES and GET_PHYLOMARKERS on buluc|tepeu using the Linux \(Shell\)

Both software packages can be run on any platform using the Docker container, or directly on UNIX and GNU/Linux servers.

In any case, you will have to work using Linux \(Shell\) commands. Therefore it is important that you have some proficiency working on the command line.

Here is a complete tutorial on biocomputing in the GNU/Linux environment, with lots of examples.

2.1 Connecting to buluc or tepeu

The whole exercises will be run on buluc or tepeu You need to log into the server with the \(ssh\) command, available in any Linux machine or from the \(mobaXterm\) package that you should have installed on your Windows machine.

ssh username@...

type your password

ssh -X vinuesa@*.*.*.*
vinuesa@*.*.*.*'s password: 
Last login: Wed Nov 11 17:46:45 2020 from akbal.*.*.*
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

Bienvenidos a BULUC

...

[vinuesa@buluc ~]$

2.2 Preliminaries: setup your working directory and fetch the test data from the TIB-filo GitHub repository

The GET_PHYLOMARKERS distribution provides a test_sequences/ directory which holds the subdirectories core_genome/, pan_genome/ and pIncAC/.

The pIncAC/ directory holds the source *.gbk GenBank files. This directory has a README.txt file that briefly describes the GenBank files.

#0.  Welcome to buluc|chaac|tepeu

# processors available
awk '/^proc/ {n++} END{print n}' /proc/cpuinfo

# disk space on file system
df -h

# check system load
htop

#>>>>> 1) mkdir sesion7_pangenomics, cd into it and create symbolic links to the source GenBank files to be processed (12 pIncA/C plasdmis)
#         The paths provided below will only work for users running in their home account on buluc.

# cd to your working directory and
mkdir sesion7_pangenomics && cd sesion7_pangenomics

# save path in variable top_dir
top_dir=$(pwd)
echo $top_dir

# fetch the data from TIB-filo's GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion7_pangenomica_GET_HOMOLOGUES/data/pIncAC.tgz
tar -xzf pIncAC.tgz && rm pIncAC.tgz && cd pIncAC

# list files
ls -l

# how many gbk files do we have?
ls *gbk | wc -l

# peak at one of them using head and tail
head -50  _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk
tail _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk

# How many CDSs do these gbk files have? - identify the genome with the lowest number of CDSs
grep -c ' CDS ' *gbk | sed 's/:/\t/' | sort -nk2

# save the absolute path of this directory for easy reference later on
gbk_dir=$(pwd)

# return to the parent directory and save its full path to the variable top_dir for easy reference
# cd ..
cd $top_dir

# where am I?
pwd

# list; you should see the pIncAC/ directory holding the *gbk files
ls

3 GET_HOMOLOGUES tutorial: computing core- and pan-genomes

In this section we will compute consensus core- and pan-genomes of the pIncA/C dataset as the intersection of clusters computed by the tree clustering algorithms implemented in get_homologues, as shown in the flowchart below:

3.1 GET_HOMOLOGUES options

GET_HOMOLOGUES is a sophisticated piece of software, with several external dependencies and auxiliary scripts, as depicted in Fig. 1.

To run GET_HOMOLOGUES on chaac we need to start the corresponding conda environment

# Activate conda's get_homologues environment
conda activate get_homologues

3.1.1 Checking that all dependencies are in place

To make sure that get_homlogues.pl is in PATH and that the kit is complete type the following command

get_homologues.pl -v

# 0. make sure get_homlogues.pl in PATH and that the kit is complete
/export/apps/get_homologues/get_homologues.pl version 26022020

Program written by Bruno Contreras-Moreira (1) and Pablo Vinuesa (2).

 1: http://www.eead.csic.es/compbio (Estacion Experimental Aula Dei/CSIC/Fundacion ARAID, Spain)
 2: http://www.ccg.unam.mx/~vinuesa (Center for Genomic Sciences, UNAM, Mexico)

Primary citations:

 Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and
 robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696-701. (PubMed:24096415)

 Vinuesa P and Contreras-Moreira B (2015) Robust Identification of Orthologues and Paralogues for 
 Microbial Pan-Genomics Using GET_HOMOLOGUES: A Case Study of pIncA/C Plasmids. In Bacterial Pangenomics,
 Methods in Molecular Biology Volume 1231, 203-232, edited by A Mengoni, M Galardini and M Fondi.

This software employs code, binaries and data from different authors, please cite them accordingly:
 OrthoMCL v1.4 (www.orthomcl.org , PubMed:12952885)
 COGtriangles v2.1 (sourceforge.net/projects/cogtriangles , PubMed=20439257)
 NCBI Blast-2.2 (blast.ncbi.nlm.nih.gov , PubMed=9254694,20003500)
 Bioperl v1.5.2 (www.bioperl.org , PubMed=12368254)
 HMMER 3.1b2 (hmmer.org)
 Pfam (http://pfam.xfam.org , PubMed=26673716)
 Diamond v0.8.25 (https://github.com/bbuchfink/diamond, PubMed=25402007)

Checking required binaries and data sources, all set in phyTools.pm :
        EXE_BLASTP : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/blastp)
        EXE_BLASTN : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/blastn)
      EXE_FORMATDB : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/makeblastdb)
           EXE_MCL : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/mcl-14-137/src/shmcl/mcl)
      EXE_MAKEHASH : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGmakehash/COGmakehash )
     EXE_READBLAST : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGreadblast/COGreadblast )
        EXE_COGLSE : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGlse/COGlse )
        EXE_COGTRI : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGtriangles/COGtriangles )
       EXE_HMMPFAM : OK (/export/apps/get_homologues-x86_64-20200226//bin/hmmer-3.1b2/binaries/hmmscan --noali --acc --cut_ga  /export/apps/get_homologues-x86_64-20200226/db/Pfam-A.hmm)
         EXE_DMNDP : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/diamond-0.8.25/diamond blastp)
         EXE_DMNFT : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/diamond-0.8.25/diamond makedb)
        EXE_INPARA : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeInparalog.pl)
         EXE_ORTHO : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeOrtholog.pl)
         EXE_HOMOL : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeHomolog.pl)
    EXE_SPLITBLAST : OK (path:/export/apps/get_homologues-x86_64-20200226/_split_blast.pl)
  EXE_SPLITHMMPFAM : OK (path:/export/apps/get_homologues-x86_64-20200226/_split_hmmscan.pl)

3.1.2 Running the main \(script\) get_homologues.pl

The main pipeline depicted above (Fig. 1) is run with the core \(script\) get_homologues.pl.

The options of the \(script\) are displayed by simply calling it with any of the following commands

  • get_homologues.pl
  • get_homologues.pl -h
usage: /export/apps/get_homologues/get_homologues.pl [options]

-h this message
-v print version, credits and checks installation
-d directory with input FASTA files ( .faa / .fna ),           (overrides -i,
   GenBank files ( .gbk ), 1 per genome, or a subdirectory      use of pre-clustered sequences
   ( subdir.clusters / subdir_ ) with pre-clustered sequences   ignores -c, -g)
   ( .faa / .fna ); allows for new files to be added later;    
   creates output folder named 'directory_homologues'
-i input amino acid FASTA file with [taxon names] in headers,  (required unless -d is set)
   creates output folder named 'file_homologues'

Optional parameters:
-o only run BLAST/Pfam searches and exit                       (useful to pre-compute searches)
-c report genome composition analysis                          (follows order in -I file if enforced,
                                                                ignores -r,-t,-e)
-R set random seed for genome composition analysis             (optional, requires -c, example -R 1234,
                                                                required for mixing -c with -c -a runs)
-m runmode [local|cluster]                                     (default local)
-n nb of threads for BLAST/HMMER/MCL in 'local' runmode        (default=2)
-I file with .faa/.gbk files in -d to be included              (takes all by default, requires -d)

Algorithms instead of default bidirectional best-hits (BDBH):
-G use COGtriangle algorithm (COGS, PubMed=20439257)           (requires 3+ genomes|taxa)
-M use orthoMCL algorithm (OMCL, PubMed=12952885)

Options that control sequence similarity searches:
-X use diamond instead of blastp                               (optional, set threads with -n)
-C min %coverage in BLAST pairwise alignments                  (range [1-100],default=75)
-E max E-value                                                 (default=1e-05,max=0.01)
-D require equal Pfam domain composition                       (best with -m cluster or -n threads)
   when defining similarity-based orthology
-S min %sequence identity in BLAST query/subj pairs            (range [1-100],default=1 [BDBH|OMCL])
-N min BLAST neighborhood correlation PubMed=18475320          (range [0,1],default=0 [BDBH|OMCL])
-b compile core-genome with minimum BLAST searches             (ignores -c [BDBH])

Options that control clustering:
-t report sequence clusters including at least t taxa          (default t=numberOfTaxa,
                                                                t=0 reports all clusters [OMCL|COGS])
-a report clusters of sequence features in GenBank files       (requires -d and .gbk files,
   instead of default 'CDS' GenBank features                    example -a 'tRNA,rRNA',
                                                                NOTE: uses blastn instead of blastp,
                                                                ignores -g,-D)
-g report clusters of intergenic sequences flanked by ORFs     (requires -d and .gbk files)
   in addition to default 'CDS' clusters
-f filter by %length difference within clusters                (range [1-100], by default sequence
                                                                length is not checked)
-r reference proteome .faa/.gbk file                           (by default takes file with
                                                                least sequences; with BDBH sets
                                                                first taxa to start adding genes)
-e exclude clusters with inparalogues                          (by default inparalogues are
                                                                included)
-x allow sequences in multiple COG clusters                    (by default sequences are allocated
                                                                to single clusters [COGS])
-F orthoMCL inflation value                                    (range [1-5], default=1.5 [OMCL])
-A calculate average identity of clustered sequences,          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-P calculate percentage of conserved proteins (POCP),          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-z add soft-core to genome composition analysis                (optional, requires -c [OMCL|COGS])

 This program uses BLAST (and optionally HMMER/Pfam) to define clusters of 'orthologous'
 genomic sequences and pan/core-genome gene sets. Several algorithms are available
 and search parameters are customizable. It is designed to process (in a HPC computer
 cluster) files contained in a directory (-d), so that new .faa/.gbk files can be added
 while conserving previous BLAST results. In general the program will try to re-use
 previous results when run with the same input directory.

For the details of each option, please have a look a the extensive get_homologues manual. Here we can review only some of the most important ones.

3.2 Running GET_HOMOLOGUES with different clustering algorithms to compute consensus core- and pan-genomes

GET_HOMOLOGUES allows users to run 3 different clustering algorithms (\(BDBH\), \(COGtriangles\) and \(OMCL\)) to define homologous gene families, as defined in Table 1.

  • Table 1. Overview of the three clustering algorithms implemented in get_homologues
name option comment
BDBH default Starting from a reference genome, keep adding genomes stepwise, while storing the sequence clusters that result of merging the best bidirectional best hits
COGS -G Merges triangles of inter-genomic symmetrical best matches, as described in PubMed=20439257
OMCL -M OrthoMCL v1.4, uses the Markov Cluster Algorithm to group sequences, with inflation (-F) controlling cluster granularity, as described in PubMed=12952885

To learn all details around these algorithms, please read the online GET_HOMOLOGUES manual

3.2.1 The sequence data for get_homologues

The only required option is either -i, used to choose an input file, or preferably -d instead, which indicates an input folder.

In previous versions only files with extensions .fa[a] and .gb[k] were considered when parsing the -d directory. Currently, GZIP- or BZIP2-compressed input files are also accepted!

By using .faa input files in theory you might only calculate clusters of protein sequences. In contrast, the advantage of using .gbk files is that you obtain both nucleotide and protein clusters. If both types of input files are combined, only protein clusters will be produced. However, if each input .faa file has a twin .fna file in place, containing the corresponding nucleotide sequences in the same order, the program will attempt to produce the corresponding clusters of nucleotide sequences. The possible input file combinations are summarized in Table 1:

  • Table 2: Valid input file combinations.
input file extensions output clusters
.gbk amino acid + DNA sequence
.faa amino acid sequence
.gbk & .faa amino acid sequence
.faa & .fna amino acid + DNA sequence
.gbk & .faa & .fna amino acid + DNA sequence

In summary, the use of an input folder or directory (-d) containing GenBank files is recommended, as it allows for new files to be added there in the future, reducing the computing required for updated analyses. For instance, if a user does a first analysis with 5 input genomes today, it is possible to check how the resulting clusters would change when adding an extra 10 genomes tomorrow, by copying these new 10 .faa / .gbk input files to the pre-existing -d folder, so that all previous BLAST searches are re-used.

To run \(get\_homologues.pl\ -d\) you need to be in the directory on top of the one holding the GenBank files. In our case this is the one saved in \(\$top\_dir\), holding the \(pIncAC\) directory with the *gbk files.

3.2.2 Computing BDBH clusters with GET_HOMOLOGUES (default clustering mode)

The BDBH algorithm implemented in GET_HOMOLOGUES is a heuristic version that does NOT cluster the results of the all-against-all blast data, just those against the smallest genome (default) or a (\(-R\ refGenome\)), as shown in Fig. 2. Therefore it is not suitable for computing pan-genomes. Use it only for core-genome calculations!

  • Figure 2. GET_HOMOLOGUE’s implementation of the BDBH algorithm using a reference genome


Let’s instruct get_homologues.pl to compute the BDBH clusters for the 12 genomes. Make sure you are in \(\$top\_dir\) and issue the following command, which will compute a core-genome based on the BDBH (default) clustering algorithm, excluding inparaloges (-e)

  • get_homologues.pl -d pIncAC -t 12 -e -n 2 #

Note: the command above is equivalent to get_homologues.pl -d pIncAC -e -n 2, as get_homologues.pl will by default use -t = number_of_genomes available in the \(\$gbk\_dir\)

# 1. run GET_HOMOLOGUES to compute the set of core-genome clusters using the BDBH, 
# get_homologues.pl -d pIncAC &> log.BDBH_def &
get_homologues.pl -d pIncAC -t 12 -e -n 2  # BDBH clusters containing sequences for the 12 genomes (-t 12) 
                                           # excluding those with inparalogues (-e): 
                                           # use 2 processors for the job
                                           # This will take ~ 3 minutes

# version 26022020
# results_directory=/home/vinuesa/pangenomics/pIncAC_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1
# diamond job:0

# checking input files...
# Salmonella_enterica_33676_pIncAC.gbk 185
# Salmonella_enterica_YU39_pIncAC.gbk 178
# Salmonella_enterica_pAM04528.gbk 180
# Salmonella_enterica_sv_Kentucky.gbk 164
# _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158
# _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173
# _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172
# _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189
# _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224
# _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148
# _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173
# _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137

# 12 genomes, 2081 sequences

# taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9

# mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_ (_algBDBH)

# running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_33676_pIncAC.gbk.fasta

# running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_YU39_pIncAC.gbk.fasta

# ...

# running BLAST searches ...
# done
# ...

# concatenating and sorting BLAST/DIAMOND results...
# sorting _Salmonella_enterica_33676_pIncAC.gbk results (0.097MB)
# sorting _Salmonella_enterica_YU39_pIncAC.gbk results (0.091MB)
# sorting _Salmonella_enterica_pAM04528.gbk results (0.1MB)
# clustering inparalogues in _Escherichia_coli_plasmid_peH4H_NC_012690.gbk
# ...
# done

# parsing blast result! (/home/vinuesa/cursos/pangenomics_PDCBq_Nov20/test_sequences/pIncAC_homologues/tmp/all.blast , 1.1MB)
# parsing file finished

# creating indexes, this might take some time (lines=2.25e+04) ...

# construct_taxa_indexes: number of taxa found = 12
# number of file addresses/BLAST queries = 2.1e+03

# clustering orthologous sequences

# clustering inparalogues in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk (reference)
# 0 sequences

# clustering inparalogues in Salmonella_enterica_33676_pIncAC.gbk
# 4 sequences

# finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_33676_pIncAC.gbk
# 59 sequences

# clustering inparalogues in Salmonella_enterica_YU39_pIncAC.gbk
# 3 sequences

# finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_YU39_pIncAC.gbk
# 84 sequences

# ...

# looking for valid ORF clusters (n_of_taxa=12)...


# number_of_clusters = 33
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_

# runtime: 218 wallclock secs ( 4.40 usr  0.75 sys + 174.83 cusr 17.86 csys = 197.84 CPU)
# RAM use: 33.7 MB

The results of the previous and following get_homologue.pl runs can be found in the pIncAC_homologues/ directory, which includes the all-against-all blast results.

ls -1d *

pIncAC
pIncAC_homologues

We will explore its contents later on.

Let’s now compute the pan-genome clusters with the COGtriangles and OMCL algorithms, which will use the blastp results already available in pIncAC_homologues/ from the previous BDBH run.

3.2.3 Clustering sequences with the COGtriangles algorithm (\(-G\))

Run with \(-t\ 0\) to compute clusters of all sizes, which is required for the pan-genome analysis.

  • get_homologues.pl -d pIncAC -G -t 0
get_homologues.pl -d pIncAC -G -t 0  # COGtriangles, computing clusters of all sizes (-t 0)
                                     # finds 408 clusters in 14 wallclock secs, as it recycles
                                     # the blast results of the previous run.
                                     
                                     
# version 26022020
# results_directory=/home/vinuesa/pangenomics/pIncAC_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1
# diamond job:0

# checking input files...
# Salmonella_enterica_33676_pIncAC.gbk 185
# Salmonella_enterica_YU39_pIncAC.gbk 178
# Salmonella_enterica_pAM04528.gbk 180
# Salmonella_enterica_sv_Kentucky.gbk 164
# _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158
# _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173
# _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172
# _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189
# _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224
# _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148
# _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173
# _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137

# 12 genomes, 2081 sequences

# taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9

# mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_ (_algCOG)

# skipped genome parsing (pIncAC_homologues/tmp/selected.genomes)


# running BLAST searches ...
# done


# parsing file (COG)
# parsing file (COG) finished

# creating indexes, this might take some time (lines=2.25e+04) ...

# construct_taxa_indexes: number of taxa found = 12
# number of file addresses/BLAST queries = 2.1e+03

# clustering orthologous sequences
# checking lineage-specific expansions
# making COGs
# prunning COGs
# done

# add_unmatched_singletons : 0 sequences, 0 taxa


# looking for valid ORF clusters (n_of_taxa=0)...


# number_of_clusters = 408
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_

# runtime: 15 wallclock secs ( 0.81 usr  0.14 sys +  2.46 cusr  0.58 csys =  3.99 CPU)
# RAM use: 28.3 MB                             

3.2.4 Clustering sequences with the OMCL algorithm (\(-M\)) and compute pan-genome composition (\(-c\))

Run with \(-t\ 0\) to compute clusters of all sizes, which is required for the pan-genome analysis

Option \(-c\) for generating pan-genome composition tables to be used for plotting the core decay and pan-genome growth curveas as a function of increasing number of genomes.

  • get_homologues.pl -d pIncAC/ -M -t 0 -c
get_homologues.pl -d pIncAC/ -M -t 0 -c     # OMCL, finds 393 clusters in 5 wallclock secs.

# ...
# identifying inparalogs in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk
# 0 sequences

# running MCL (inflation=1.5) ...
# running MCL finished

# find_OMCL_clusters: parsing clusters (/home/vinuesa/pangenomics/pIncAC_homologues/tmp/all_ortho.mcl)


# adding _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk: core=32 pan=336

# pan-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=pIncAC_homologues/pan_genome_algOMCL.tab
genomes mean    stddev  |   samples
0   167 22  |   172 170 183 137 164 217 170 169 137 154 
1   215 22  |   215 206 184 203 239 262 229 208 200 201 
2   242 15  |   246 245 236 247 258 269 242 243 216 220 
3   273 22  |   274 281 292 282 301 288 270 262 217 263 
4   291 21  |   295 322 294 288 301 312 307 268 251 268 
5   307 15  |   300 329 302 307 303 317 315 315 270 309 
6   316 17  |   302 335 307 307 303 336 334 323 282 330 
7   317 18  |   303 336 307 312 303 340 335 324 282 331 
8   324 17  |   340 336 317 320 309 341 335 324 282 331 
9   330 10  |   343 337 336 321 312 341 335 331 314 331 
10  334 9   |   343 345 336 321 331 341 343 332 317 331 
11  341 4   |   348 346 339 340 338 341 344 340 335 336 

# core-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=pIncAC_homologues/core_genome_algOMCL.tab
genomes mean    stddev  |   samples
0   167 22  |   172 170 183 137 164 217 170 169 137 154 
1   112 22  |   123 106 165 97  134 116 89  90  94  109 
2   93  11  |   90  83  105 94  117 81  83  88  92  95  
3   75  8   |   74  62  83  81  74  74  62  72  91  75  
4   65  13  |   55  58  78  57  72  53  43  69  89  74  
5   57  13  |   53  57  75  49  69  53  35  42  69  72  
6   49  10  |   53  42  54  49  61  41  35  34  66  51  
7   44  8   |   53  35  46  43  61  40  35  34  41  51  
8   42  6   |   51  34  44  41  45  40  35  34  40  51  
9   38  5   |   50  34  33  41  40  39  34  34  35  43  
10  36  4   |   42  32  33  40  32  39  32  34  34  42  
11  32  0   |   32  32  32  32  32  32  32  32  32  32  


# add_unmatched_singletons : 117 sequences, 12 taxa


# looking for valid ORF clusters (n_of_taxa=0)...


# number_of_clusters = 393
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_

# runtime:  3 wallclock secs ( 1.82 usr  0.13 sys +  1.79 cusr  0.26 csys =  4.00 CPU)
# RAM use: 30.3 MB

3.2.4.1 Plotting core- and pan-genome decay/growth curves with plot_pancore_matrix.pl

Remember that for the OMCL clustering we ran the following code:

get_homologues.pl -d pIncAC/ -M -t 0 -c

The \(-c\) flag tells get_homologues.pl to compute the core- and pan-genome composition data by randomly sampling 10 times pairs of genomes to compute the observed numbers of common and new genes. The results are written to the core_genome_algOMCL.tab and pan_genome_algOMCL.tab files, found in the pIncAC_homologues results directory. These tables are used as input for plot_pancore_matrix.pl, which renders the graphics and fits different functions to the data.

  • The plot_pancore_matrix.pl options
    • plot_pancore_matrix.pl -h
[options]: 
-h   this message
-i   input .tab data file                   (required, generated by get_homologues.pl)
-f   type of fit [pan|core_Tettelin|core_Willenbrock|core_both]
                                            (default: core_Tettelin, PubMed:16172379,18088402)
-F   font scale for fitted formulas         (optional, default=0.8)
-a   save snapshots for animations in dir   (optional, example: -a animation)
  • Let us cd into pIncAC_homologues/, locate the core_genome_algOMCL.tab, and pan_genome_algOMCL.tab files, and compute the graphs with plot_pancore_matrix.pl
cd pIncAC_homologues/

# save the absolute path to this directory in the blast_dir variable
blast_dir=$(pwd)

# these are the relevant input files: core_genome_algOMCL.tab and pan_genome_algOMCL.tab
ls *.tab

# plot core-decay curves (Tettelin & Willenbrock fits)
plot_pancore_matrix.pl -i core_genome_algOMCL.tab -f core_both 

# plot pan-genome growth curves (Tettelin's exponential fit)
plot_pancore_matrix.pl -i pan_genome_algOMCL.tab -f pan

# if you logged in with ssh -X, you may use eog (Eye of GNOME) or firefox to display the png files
#  Note: the & at the end of the command runs it in the background
eog core_genome_algOMCL.tab_core_both.png &
eog pan_genome_algOMCL.tab_pan.png &

Figure 3. The core-genome decay curves for 12 pIncA/C plasmids

Figure 4. The pan-genome growth curve for 12 pIncA/C plasmids

Based on the output shown above, what type of pan-genome (closed or open) do you think best represents this collection of pIncA/C plasmids?

3.3 Configuring a batch script to run multiple get_homologues.pl analyses in non-interactive mode for large datasets

Before exploring the contents of the get_homologues.pl runs stored in the [pIncAC]_homologues directory, a brief note on how to setup multiple sequential get_homologues.pl runs in non-interactive mode. This is the best option to run large datasets which may take hours or days to finish.

This is easy. just type the get_homologues.pl invocations that you would like to run into a text file (\(script\)), which i have named run_get_hom.batch holding the code shown below:

  • Contents of the run_get_hom.batch \(script\)
    get_homologues.pl -d pIncAC -t 12 -e -n 2 &> log.BDBH
    get_homologues.pl -d pIncAC -G -t 0 &> log.COG
    get_homologues.pl -d pIncAC/ -M -t 0 -c &> log.OMCL
    

The run_get_hom.batch \(script\) would run exactly the same analyses that we have run interactively, but in a non-interactive mode.

Note that the output that each get_homologues.pl invocation would normally print to STDOUT (the screen) is redirected to a file called log.[ALGORITHM], making use of the “&>” syntax which instructs to redirect STDOUT and STDERR to the file indicated after the symbols.

3.3.1 Running a batch command \(script\) with \(nohup\) in the background

Now we can call the batch script in the following manner:

  • nohup bash run_get_hom.batch &> /dev/null &

The \(nohup\) (no hang-up) command combined with the strange-looking “&> /dev/null &” tells the server to run the script in the background and to make the run independent of the login session. This means that after issuing the command, the script runs detached from our terminal. We can shutdown our computer and the process will continue executing on the server.

After some time, we can check the results of the runs by listing the files and directories as shown:
ls -1
log.BDBH
log.COG
log.OMCL
pIncAC
pIncAC_homologues
run_get_hom.batch

Another useful command to know is tail -f log.BDBH, which can be called on any of the logfiles while they are being written to disk, showing their last 10 lines as they’re appended to the tail of the file in real time. Nice!

3.4 Explore the results of get_homologues.pl runs stored in the [pIncAC]_homologues directory

As previously mentioned, GET_HOMOLOGUES writes the results in the **[pIncAC]_homologues/ directory**, stored in the \(\$blast\_dir\) variable.

Therein you will find the following files and directories

  • Regular files
    • genome-specific CDSs: .gbk.fna
    • genome-specific proteomes: .gbk.faa
    • genome-specific proteomes wit simplified header for blastp analyses: .gbk.fasta
    • genome-specific blastp databases: .gbk.fasta.phr .gbk.fasta.pin .gbk.fasta.psq
    • pair-wise (all-against-all) blastp results as gzip compressed files: genome1.gbk.fasta_genome2.gbk.fasta.blast.gz
    • pan-genome composition tables (when using option \(-c\): core_genome_algOMCL.tab pan_genome_algOMCL.tab
  • Directories
    • tmp: which holds diverse log-files, combined and sorted blast-tables, genome indices …
    • run specific directories like KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_, holding the core and|or pan-genome clusters at the DNA (.fna) and protein (.faa) levels

The run-specific output directories are named using the following components to make them unique for each run configuration:

BuchaphCc_f0_alltaxa_algBDBH_e0_
|         |  |       |       |   
|         |  |       |       -e option was not used (inparalogues are in)
|         |  |       the clustering algorithm is BDBD (default)
|         |  all clusters contain at least 1 sequence from each taxa (default -t behavior)
|         -f option not used (no length filtering)
reference proteome

Let’s explore the results directory pIncAC_homologues/ using basic \(Shell\) commands:

# assumes that we are in $blast_dir (the results directory pIncAC_homologues/)
cd $blast_dir

# list
ls | less

ls *faa *fna *fasta
ls *faa *fna *fasta | wc

head -5 Salmonella_enterica_YU39_pIncAC.gbk.f*a

ls Salmonella_enterica_YU39_pIncAC.gbk.f*a.*

# find the subdirectories holding the run-specific clusters of homologous sequences (gene families)
find . -type d

# exlore results (clusters) in the KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_
#   which hods the BDBH results for the 12 genomes (-t 12), without inparalogues (-e)
ls

ls *fna | wc -l
ls *faa | wc -l

grep -c '>' *fna
grep -c '>' *faa

3.5 Computing a consensus core-genome from the BDBH, COGtriangles and OMCL clustering solutions using compare_clusters.pl

GET_HOMOLOGUES is unique among available pangenomics software in many respects. One of them is its capacity to compute consensus core-genome clusters from the solutions generated by the three clustering algorithms it implements, and consensus pan-genome clusters from COGtriangles and OMCL clustering results, as demonstrated below.

  • The compare_clusters.pl options
    • compare_clusters.pl -h
[options]: 
-h   this message
-d   comma-separated names of cluster directories                  (min 1 required, example -d dir1,dir2)
-o   output directory                                              (required, intersection cluster files are copied there)
-n   use nucleotide sequence .fna clusters                         (optional, uses .faa protein sequences by default)
-r   take first cluster dir as reference set, which might contain  (optional, by default cluster dirs are expected
     a single representative sequence per cluster                   to be derived from the same taxa; overrides -t,-I)
-s   use only clusters with syntenic genes                         (optional, parses neighbours in FASTA headers)
-t   use only clusters with single-copy orthologues from taxa >= t (optional, default takes all intersection clusters; example -t 10)
-I   produce clusters with single-copy seqs from ALL taxa in file  (optional, example -I include_list; overrides -t)
-m   produce intersection pangenome matrices                       (optional, ideally expects cluster directories generated
                                                                    with get_homologues.pl -t 0)
-x   produce cluster report in OrthoXML format                     (optional)
-T   produce parsimony-based pangenomic tree
  • Compute consensus core-genome clusters using compare_clusters.pl
# Compute consensus core-genome clusters using compare_clusters.pl of the GET_HOMOLOGUES suite.
# Note that we will run the compare_clusters.pl script twice, once with the -n flag 
# (to compute the consesus clusters at the DNA level, producing *.fna files)
# and a second instance without the flag, to get the protein clusters (*.faa files)

# assumes we are in $blast_dir
cd $blast_dir

# find the subdirectories holding the run-specific results
find . -type d

# Compute the core genome; use -t 12 to keep only clusters containing a copy of the gene in each of the 12 genomes analyzed
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12 -n

# repeat command, without -n, to compute the clusters at the protein level (*faa files)
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12

# Lets explore the core_BCM directory
cd core_BCM

ls *fna | wc -l
ls *faa | wc -l

grep -c '>' *faa


# confirm that all 31 clusters contain only one sequence from each plasmid/proteome
grep '>' *faa | cut -d\| -f2,3 | sort | uniq -c

# display the Venn diagram of the consensus core-genome clusters
ls *svg

eog venn_t12.svg &

Figure 5. Venn diagram depicting the consensus core-genome of 31 genes for the 12 input pIncA/C sequences.

3.6 Computing a consensus pan-genome with GET_HOMOLOGUES from the COGtriangles and OMCL clusters and visualization of the pangenome’s structure with phylogenies and graphs

To compute a consensus pan-genome, only the \(COGtriangles\) and \(OMCL\) algorithms should be used! as they don’t require a reference sequence.

Run compare_clusters.pl with the \(-t\ 0\) option, to consider clusters of all occupancy sizes (from singletons to multigene families), as demonstrated below.

# Compute consensus pan-genome clusters and matrix using compare_clusters.pl of the 
# GET_HOMOLOGUES suite.
# Note that we run the compare_clusters.pl script twice, once with the -n and -m flags 
# (to compute the consesus clusters at the DNA level, *.fna files
# and the pan-genome matrix) and a second instance without these flags, 
# to get the protein clusters (*.faa files). 
# Note also that we exclude the directory holding the BDBH clusters, 
# as these are not suitable to compute a proper pan-genome matrix, since the 
# BDBH clusters always contain the reference sequence, missing those that do not contain it.

cd $blast_dir

# get the consensus COG-OMCL pan-genome clusters with -t 0 (all sizes) and -m to write pan-genome matrix
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM -m -n -T

# repeat without -m and -n, to generate the protein clustes
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM

# Explore the pan-genome clusters
cd pan_CM

# how many clusters
ls *faa | wc -l

# print a sorted list of cluster sizes (gene contents of each cluster)
grep -c '>' *faa | sed 's/:/\t/' | sort -nk2
grep -c '>' *faa | sed 's/:/\t/' | sort -nrk2 > pan-genome_cluster_compositon_stats.tsv

# lets explore the structure of the pan-genome
grep -c '>' *faa | sed 's/:/\t/' | cut -f2 | sort -n | uniq -c | awk 'BEGIN{OFS="\t"; print "size\tcount"}{print $2,$1}'


# Lets explore the structure of the pan-genome parsimony tree pangenome_matrix_t0.phylip.ph

figtree pangenome_matrix_t0.phylip.ph &



                    /---+  Escherichia coli plasmid pAPEC1990 61 NC 019066 gbk...                             
                 /--+                                                                                         
                 |  \+  Escherichia coli UMNK88 plasmid pUMNK88 NC 017645 gbk...                              
                 |                                                                                            
                /+  /---+  Escherichia coli plasmid pPG010208 NC 019065 gbk...                                
                ||  |                                                                                         
                |\--+      /------------+  Aeromonas hydrophila plasmid pRA1 NC 012885 gbk...                 
                |   |      |                                                                                  
                |   \------+        /------------+  Klebsiella pneumoniae plasmid pNDM KN NC 019153 gbk...    
                |          |  /-----+                                                                         
 /--------------+          \--+     \--------------+  Escherichia coli plasmid pNDM 1 Dok01 NC 018994 gbk...  
 |              |             |                                                                               
 |              |             \-------+ Salmonella enterica sv Kentucky gbk...                                
 |              |                                                                                             
 |              |     /---------+  Escherichia coli plasmid peH4H NC 012690 gbk...                            
 |              |  /--+                                                                                       
=+              \--+  \  Escherichia coli plasmid pAR060302 NC 012692 gbk...                                  
 |                 |                                                                                          
 |                 \+ Salmonella enterica pAM04528 gbk...                                                     
 |                                                                                                            
 +-------+ Salmonella enterica YU39 pIncAC gbk...                                                             
 |                                                                                                            
 \--------------+ Salmonella enterica 33676 pIncAC gbk...                                                     
                                                                                                              
 |------|------|-----|------|------|------|-----|---                                                          
 0     25     50    75    100    125    150   175                                                             
 substitutions/site



# display the Venn diagram of the consensus pan-genome clusters
ls *svg
eog venn_t0.svg &

Figure 6. Venn diagram depicting the consensus pan-genome clusters for the 12 input pIncA/C sequences contains 373 cluster. The corresponding pan-genome matrix is provided in the test_sequences/pan_genome/ directory

3.7 Graphical displays of the core and pan-genome structure with the auxiliary \(script\) parse_pangenome_matrix.pl

Graphical analysis of pangenome’s structure and parsing pan-genome tables to identify lineage-specific genes or family expansions can be conveniently done with the auxiliary \(script\) parse_pangenome_matrix.pl, as shown below.

  • parse_pangenome_matrix.pl options
    • parse_pangenome_matrix.pl -h
parse_pangenome_matrix.pl 

[options]: 
-h   this message
-m   input pangenome matrix .tab                               (required, made by compare_clusters.pl)
-s   report cloud,shell,soft core and core clusters            (optional, creates plot if R is installed)
-l   list taxa names present in clusters reported in -m matrix (optional, recommended before using -p option)
-x   produce matrix of intersection pangenome clusters         (optional, requires -s)
-I   use only taxon names included in file                     (optional, ignores -A,-g,-e)
-A   file with taxon names (.faa,.gbk,.nucl files) of group A  (optional, example -A clade_list_pathogenic.txt)
-B   file with taxon names (.faa,.gbk,.nucl files) of group B  (optional, example -B clade_list_symbiotic.txt)
-a   find genes/clusters which are absent in B                 (optional, requires -B)
-g   find genes/clusters present in A which are absent in B    (optional, requires -A & -B)
-e   find gene family expansions in A with respect to B        (optional, requires -A & -B)
-S   skip clusters with occupancy <S                           (optional, requires -x/-a/-g, example -S 2)
-P   percentage of genomes that must comply presence/absence   (optional, default=100, requires -g)
-p   plot pangenes on the genome map of this group A taxon     (optional, example -p 'Escherichia coli K12',
                                                                requires -g, -A & -B, only works with clusters
                                                                derived from input files in GenBank format)

3.7.1 Plotting cloud, shell, soft core and core clusters with parse_pangenome_matrix.pl (-m pangenome_matrix_t0.tab -s)

Use option -s to report and graphically depict cloud, shell, soft core and core clusters

  • parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -s
parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -s

# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -I  -A  -B  -a 0 -g 0 -e 0 -p  -s 1 -l 0 -x 0 -P 100 -S 0

# matrix contains 373 clusters and 12 taxa

# cloud size: 183 list: pangenome_matrix_t0__cloud_list.txt
# shell size: 130 list: pangenome_matrix_t0__shell_list.txt
# soft core size: 60 list: pangenome_matrix_t0__softcore_list.txt
# core size: 31 (included in soft core) list: pangenome_matrix_t0__core_list.txt

# using default colors, defined in %COLORS

# globals controlling R plots: $YLIMRATIO=1.2

# shell bar plots: pangenome_matrix_t0__shell.png , pangenome_matrix_t0__shell.pdf
# shell circle plots: pangenome_matrix_t0__shell_circle.png , pangenome_matrix_t0__shell_circle.pdf

# pan-genome size estimates (Snipen mixture model PMID:19691844): pangenome_matrix_t0__shell_estimates.tab
Core.size Pan.size BIC LogLikelihood
2 components 31 375 2897.63753894041 -1439.93640184074
3 components 24 436 1648.86318314942 -809.627645525601
4 components 21 507 1626.68849010487 -792.618720583683  <== best BIC VALUE!
5 components 18 508 1638.26140215658 -792.483598189895
6 components 0 508 1649.91033432187 -792.386485852895
7 components 6 508 1661.76417317612 -792.391826860376
8 components 6 508 1673.59937473543 -792.387849220388
9 components 10 507 1685.47784762243 -792.405507244245
10 components 20 505 1697.57598545494 -792.532997740852
Sample 31 373 NA NA


# occupancy stats:
    cloud   shell   soft_core   core
Salmonella_enterica_33676_pIncAC.gbk    52  46  59  31
Salmonella_enterica_YU39_pIncAC.gbk 43  56  60  31
Salmonella_enterica_pAM04528.gbk    5   92  59  31
Salmonella_enterica_sv_Kentucky.gbk 15  78  58  31
_Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk    27  68  53  31
_Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk  5   90  60  31
_Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk    4   88  60  31
_Escherichia_coli_plasmid_pAR060302_NC_012692.gbk   3   105 60  31
_Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk    51  98  60  31
_Escherichia_coli_plasmid_pPG010208_NC_019065.gbk   9   70  59  31
_Escherichia_coli_plasmid_peH4H_NC_012690.gbk   4   86  53  31
_Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk    28  51  50  31


# The command produces also lists of genes in each pan-genome "compartment"
ls *list.txt


# >>>>>> Textual exploration of the pangenome_matrix_t0__*_list.txt
#  inspect the pangenome_matrix_t0__*_list.txt for the presence plasmid replication and mobilization genes
grep -Ei 'mob|tra|rep' pangenome_matrix_t0*txt | egrep -v 'transpo|trans' 

# 9.5 inspect the pangenome_matrix_t0__*_list.txt for the presence of 
# beta-lactamase or tetracycline resistance genes
grep -Ei 'bla|NDM|tet|efflux|resist|aminogly|sul|aad|RND' *txt


# 10. Display the barplot depicting the pan-genome structure of 12 pIncA/C plasmids
eog pangenome_matrix_t0__shell.png &

Figure 7. Barplot of the pan-genome structure of 12 pIncA/C plasmids

3.8 Identifying lineage-specific genes using parse_pangenome_matrix.pl

The final GET_HOMOLOGUES feature I’m gonna show is how to parse the pan-genome matrix to identify lineage-specific genes and family expansions.

As shown below on the pan-genome parsimony tree we computed previously with compare_clusters.pl -T, there are two plasmids labeled as pNDM. These contain the “New Dehli metallo-betaclactamase” NDM1, which confers resistance to essentially all beta-lactams, including carbepenems, except aztreonam.


                    /---+  Escherichia coli plasmid pAPEC1990 61 NC 019066 gbk...                             
                 /--+                                                                                         
                 |  \+  Escherichia coli UMNK88 plasmid pUMNK88 NC 017645 gbk...                              
                 |                                                                                            
                /+  /---+  Escherichia coli plasmid pPG010208 NC 019065 gbk...                                
                ||  |                                                                                         
                |\--+      /------------+  Aeromonas hydrophila plasmid pRA1 NC 012885 gbk...                 
                |   |      |                                                                                  
                |   \------+        /------------+  Klebsiella pneumoniae plasmid pNDM KN NC 019153 gbk...   <== 
                |          |  /-----+                                                                         
 /--------------+          \--+     \--------------+  Escherichia coli plasmid pNDM 1 Dok01 NC 018994 gbk... <==
 |              |             |                                                                               
 |              |             \-------+ Salmonella enterica sv Kentucky gbk...                                
 |              |                                                                                             
 |              |     /---------+  Escherichia coli plasmid peH4H NC 012690 gbk...                            
 |              |  /--+                                                                                       
=+              \--+  \  Escherichia coli plasmid pAR060302 NC 012692 gbk...                                  
 |                 |                                                                                          
 |                 \+ Salmonella enterica pAM04528 gbk...                                                     
 |                                                                                                            
 +-------+ Salmonella enterica YU39 pIncAC gbk...                                                             
 |                                                                                                            
 \--------------+ Salmonella enterica 33676 pIncAC gbk...                                                     
                                                                                                              
 |------|------|-----|------|------|------|-----|---                                                          
 0     25     50    75    100    125    150   175                                                             
 substitutions/site

As seen on the tree, the two pNDMs form a monophyletic group.

Let’s now find out what unique genes are carried specifically by these two plasmids.

# Detection of lineage-specific genes using parse_pangenome_matrix.pl
#    assumes that you are still in /home/vinuesa/pangenomics/pIncAC_homologues/pan_CM

# find the clusters containing the NDM-1 genes
grep -l blaNDM-1 *faa

# find the plasmids that hold these genes containing the NDM-1 genes
for f in $(grep -l blaNDM-1 *faa); do echo "### $f"; grep '>' $f | cut -d\| -f4; done


# generate the lists of genomes to be compared for lineage specific genes (in list A vs. B)
panGmat_dir=$(pwd)
cd ${gbk_dir}
ls *gbk | grep pNDM > listA_pNDB
ls *gbk | grep -v pNDM > listB_nonNDB
cd $panGmat_dir

# >>> parse the pangenome matrix file to find the listA-specific genes
parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B $gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab


# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -I  -A /home/vinuesa/pangenomics/pIncAC/listA_pNDB -B $/pangenomics/pIncAC/listB_nonNDB -a 0 -g 1 -e 0 -p  -s 0 -l 0 -x 0 -P 100 -S 0

# matrix contains 373 clusters and 12 taxa

# taxa included in group A = 2

# taxa included in group B = 10


# finding genes present in A which are absent in B ...
# file with genes present in set A and absent in B (19): pangenome_matrix_t0__pangenes_list.txt


# explore the file with genes present in set A and absent in B (19): pangenome_matrix_t0__pangenes_list.txt
#  i.e. present only in the NDM-1 expressing plasmids
cat pangenome_matrix_t0__pangenes_list.txt

# /export/apps/get_homologues/parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -A /home/vinuesa/pangenomics/pIncAC/listA_pNDB -B /home/vinuesa/pangenomics/pIncAC/listB_nonNDB -g 1 -e 0 -p  -P 100 -S 0
# genes present in set A and absent in B (19):
2020_hypothetical_protein.fna
2021_type_I_restriction-m...fna
2022_anticodon_nuclease.fna
2023_restriction_modifica...fna
2024_ATP-dependent_DNA_he...fna
2025_DNA_replication_and_...fna
2026_type_I_site-specific...fna
2027_Tn7-like_transpositi...fna
2028_Tn7-like_transpositi...fna
2029_Tn7-like_transpositi...fna
2030_Tn7-like_transpositi...fna
2033_truncated_Rhs_family...fna
2036_GroES.fna
2037_CutA1.fna
2038_oxidoreductase_domai...fna
2039_phosphoribosylanthra...fna
2040_bleMBL.fna
2041_NDM-1.fna
2043_RmtB.fna

3.8.1 Mapping the pNDM-specific genes on a plasmid map

As seen from the output above, the cluster numbers are consecutive, indicating that the pNDM genes are part of a larger gene cluster. This can be visually confirmed by mapping the specific genes on the map of one of the “listA” genomes

  • The parse_pangenome_matrix.pl options, again
    • parse_pangenome_matrix.pl
[options]: 
-h   this message
-m   input pangenome matrix .tab                               (required, made by compare_clusters.pl)
-s   report cloud,shell,soft core and core clusters            (optional, creates plot if R is installed)
-l   list taxa names present in clusters reported in -m matrix (optional, recommended before using -p option)
-x   produce matrix of intersection pangenome clusters         (optional, requires -s)
-I   use only taxon names included in file                     (optional, ignores -A,-g,-e)
-A   file with taxon names (.faa,.gbk,.nucl files) of group A  (optional, example -A clade_list_pathogenic.txt)
-B   file with taxon names (.faa,.gbk,.nucl files) of group B  (optional, example -B clade_list_symbiotic.txt)
-a   find genes/clusters which are absent in B                 (optional, requires -B)
-g   find genes/clusters present in A which are absent in B    (optional, requires -A & -B)
-e   find gene family expansions in A with respect to B        (optional, requires -A & -B)
-S   skip clusters with occupancy <S                           (optional, requires -x/-a/-g, example -S 2)
-P   percentage of genomes that must comply presence/absence   (optional, default=100, requires -g)
-p   plot pangenes on the genome map of this group A taxon     (optional, example -p 'Escherichia coli K12',
                                                                requires -g, -A & -B, only works with clusters
                                                                derived from input files in GenBank format)

Let’s use option \(-p\) to plot pangenes on the genome map of _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk, a representative of “group A” genomes saved in listA_pNDB.

  • parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B $gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab -p _Klebsiella_pneumoniae_plasmid
#  plot the positions of listA-specific genes on one of the genomes of this group
#      use option -l to list taxa names present in clusters reported in -m matrix
parse_pangenome_matrix.pl -l -m pangenome_matrix_t0.tab
parse_pangenome_matrix.pl -A $gbk_dir/listA_pNDB -B $gbk_dir/listB_nonNDB -g -m pangenome_matrix_t0.tab -p "Klebsiella pneumoniae"

# visualize the mapping of the pNDM-specific genes on _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk, as a representative of "group A"
eog pangenome_matrix_t0_Klebsiellapneumoniae_NC_019153.png &
  • Figure 8. Mapping the pNDM-specific genes on _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk, as a representative of “group A”

After localizing the genomic region of interest, we can plot it with add hoc tools like snapgene-viewer

  • the full pNDM-KN_NC_019153 map

  • The unique region, spanning region between nucleotides 79,000 .. 111,500 of pNDM-KN_NC_019153

  • and a zoom-in into the blaNDM-1 region

3.9 Deactivate the get_homologues conda environment

Before leaving, run:

# Deactivate the get_homologues conda environment
conda deactivate

to deactivate the get_homologues conda environment.

That’s it, cheers.


4 Estimating core- and pan-genome maximum-likelihood phylogenies with the GET_PHYLOMARKERS suite

GET_PHYLOMARKERS can be used to estimate maximum-likelihood phylogenies from core- and pan-genome datasets computed by GET_HOMOLOGUES.

The main usage of the GET_PHYLOMARKERS pipeline is to estimate core-genome phylogenies with selected optimal markers.

To do so, it implements a series of sequential filters (Fig. 8 and explained below) to identify markers with optimal attributes for phylogenomic inference, that satisfy the following criteria:

  1. Codon (gene) alignments without significant signals of homologous recombination (intragene mosaicism)
  2. Gene trees without anomalous branch-lengths or topologies, as expected under the multispecies-coalescent null model
  3. Compute bipartition support values under best-fitting substitution models to eliminate poorly-resolved trees




4.1 GET_PHYLOMARKERS usage

The GET_PHYLOMARKERS package implements many analysis possibilities. Here we will review only some of them. Visit the GET_PHYLOMARKERS manual

4.1.1 Usage synopsis

  1. The pipeline (Fig. 8) is run by executing the main script run_get_phylomarkers_pipeline.sh inside a folder containing twin *.fna and *.faa FASTA files for orthologous single-copy CDSs and translation products (Fig. 8).
  2. There are two runmodes: -R 1 (for phylogenetics) and -R 2 (for population genetics).
  3. The pipeline can be run on two molecular types: DNA or protein sequences (-t DNA|PROT). The latter is intended for the analysis of more divergent genome sequences, typically above the genus level.
  4. From version 2.0 onward, GET_PHYLOMARKERS uses either the FastTree (FT) or IQ-TREE (IQT) fast ML tree search algorithms, controlled with the -A [F|I] option [default: I], respectively.
  5. As of version 2.0 (January 22, 2018), GET_PHYLOMARKERS uses IQ-TREE version 1.6.1 (released Dec. 28th, 2017) which implements a fast search option, which almost matches the speed of FastTree, but retaining the accuracy of IQ-TREE 1.5.* Model-selection is performed with the -fast flag to for maximal speed both for gene- and species tree searches.
  6. The global molecular-clock hypothesis can be evaluated for DNA (codon) alignments (-R 1 -t DNA -K). It is not yet implemented for protein sequences.
  7. GET_PHYLOMARKERS can compute basic descriptive statistics of population DNA polymorphisms and neutrality tests when run in population genetics mode (-R 2).

4.1.2 GET_PHYLOMARKERS \(scripts\)

  • Shell scripts
    • estimate_pangenome_phylogenies.sh
    • run_get_phylomarkers_pipeline.sh
    • run_parallel_molecClock_test_with_paup.sh
    • run_pexec_cmmds.sh
  • R scripts
    • compute_suppValStas_and_RF-dist.R
    • install_R_deps.R
    • run_kdetrees.R
  • Perl scripts
    • add_labels2tree.pl
    • add_nos2fasta_header.pl
    • concat_alignments.pl
    • convert_aln_format_batch_bp.pl
    • pal2nal.pl
    • popGen_summStats.pl
    • remove_uninformative_sites_from_aln.pl
    • rename.pl
    • run_parallel_cmmds.pl

4.1.3 Options and parameters to control de behaviour of estimate_pangenome_phylogenies.sh

estimate_pangenome_phylogenies.sh is the core \(script\) of the GET_PHYLOMARKERS distribution. It implements many sophisticated phylogenetic analysis features using state-of-the-art methods and software. It is highly configurable, as reflected in the many options that an advanced user can set to fine-tune its behaviour.

  • run_get_phylomarkers_pipeline.sh options can be displayed with any of the following equivalent commands
    • run_get_phylomarkers_pipeline.sh
    • run_get_phylomarkers_pipeline.sh -h
   run_get_phylomarkers_pipeline.sh v.2.2.8.1_16Jul2019 OPTIONS:

   REQUIRED:
    -R <integer> RUNMODE
       1 select optimal markers for phylogenetics/phylogenomics (genomes from different species)
       2 select optimal markers for population genetics (genomes from the same species)
    -t <string> type of input sequences: DNA|PROT

   OPTIONAL:
     -h flag, print this short help notes
     -H flag, print extensive Help and details about search modes and models
     -A <string> Algorithm for tree searching: <F|I> [FastTree|IQ-TREE]                            [default:I]
     -c <integer> NCBI codontable number (1-23) for pal2nal.pl to generate codon alignment         [default:11]
     -C flag to print codontables
     -D flag to print debugging messages; please use if you encounter problems executing the code  [default: 0]
     -e <integer> select gene trees with at least (min. = 4) external branches  [default: 4]
     -f <string> GET_HOMOLOGUES cluster format <STD|EST>                                           [default: STD]
     -k <real> kde stringency (0.7-1.6 are reasonable values; less is more stringent)              [default: 1.5]
     -K flag to run molecular clock test on codon alignments                                       [default: 0]
     -l <integer> max. spr length (7-12 are reasonable values)                                     [default: 10]
     -m <real> min. average support value (0.7-0.8 are reasonable values) for trees to be selected [default: 0.7]
     -M <string> base Model for clock test (use one of: GTR|TrN|HKY|K2P|F81); uses +G in all cases [default: GTR]
     -n <integer> number of cores/threads to use                                                   [default: all cores]
     -N <integer> number of IQ-TREE searches to run [only active with -T high]                     [default: 5]
     -q <real> quantile (0.95|0.99) of Chi-square distribution for computing molec. clock p-value  [default: 0.99]
     -r <string> root method (midpoint|outgroup)                                                   [default: midpoint]
     -s <integer> number of spr rounds (4-20 are reasonable values) for FastTree tree searching    [default: 4]
     -S <string> quoted 'comma-separated list' of base models to be evaluated by IQ-TREE 
                 when estimating the species tree from the concatenated supermatrix  (see -H for details). 
         If no -S is passed, then sinlge default models are used, as shown below
              <'JC,F81,K2P,HKY,TrN,TNe,K3P,K81u,TPM2,TPM2u,TPM3,TPM3u,
          TIM,TIMe,TIM2,TIM2e,TIM3,TIM3e,TVM,TVMe,SYM,GTR'>              for DNA alignments    [default: GTR]
              <'BLOSUM62,cpREV,Dayhoff,DCMut,FLU,HIVb,HIVw,JTT,JTTDCMut,LG,
                mtART,mtMAM,mtREV,mtZOA,Poisson,PMB,rtREV,VT,WAG'>           for PROT alignments   [default: LG]
     -T <string> tree search Thoroughness: high|medium|low|lowest (see -H for details)             [default: medium]
     -v flag, print version and exit
     -V flag, print software versions

   INVOCATION EXAMPLES:
     1. default on DNA sequences (uses IQ-TREE evaluating a subset of models specified in the detailed help)
          run_get_phylomarkers_pipeline.sh -R 1 -t DNA
     2. thorough FastTree searching and molecular clock analysis on DNA sequences using 10 cores:
          run_get_phylomarkers_pipeline.sh -R 1 -t DNA -A F -k 1.2 -m 0.7 -s 20 -l 12 -T high -K -M HKY -q 0.95 -n 10
     3. FastTree searching on a huge protein dataset for fast inspection
          run_get_phylomarkers_pipeline.sh -R 1 -A F -t PROT -m 0.6 -k 1.0 -T lowest
     4. To run the pipeline on a remote server, we recommend using the nohup command upfront, as shown below:
        nohup run_get_phylomarkers_pipeline.sh -R 1 -t DNA -S 'TNe,TVM,TVMe,GTR' -k 1.0 -m 0.75 -T high -N 5 &> /dev/null &
     5. Run in population-genetics mode (generates a table with descritive statistics for DNA-polymorphisms 
          and the results of diverse neutrality test)
      run_get_phylomarkers_pipeline.sh -R 2 -t DNA

   NOTES
     1: run from within the directory holding core gene clusters generated by get_homologues.pl -e or
          compare_clusters.pl with -t no_genome_gbk files (core genome: all clusters with a single gene copy per genome)
     2: If you encounter any problems, please run the script with the -D -V flags added at the end of the command line,
          redirect STOUT to a file and send us the output, so that we can better diagnose the problem.
      e.g. run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 1.0 -m 0.7 -s 8 -l 10 -T high -K -D &> ERROR.log
      


For the details, please visit the manual

4.2 Searching for the best core-genome phylogeny of the pIncA/C plasmids with GET_PHYLOMARKERS

# assumes that you are within test_sequences/pIncAC_homologues/ or in test_sequences/core_BCM
cd core_BCM

# 1. make sure we have the same nuber of faa and fna cluster files; 
#    Note that GET_PHYLOMARKERS will check that for you automatically ;)
find . -name '*.faa' | wc -l  # 31
find . -name '*.fna' | wc -l  # 31

# 2.1 view the help menu of the main script run_get_phylomarkers_pipeline.sh
run_get_phylomarkers_pipeline.sh -h

# 2.2 run the pipeline under default values. See the manual for the many additional options
run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 1.0 # takes 55 seconds on the above-mentioned machine

4.2.1 Analyzing the pipeline’s output

GET_PHYLOMARKERS prints informative progress messages to screen and to a logfile. Compare them with the pipeline’s flowchart shown in Fig. 8 to make sure you understand what computations are performed at each step.

  • It first reports that the input .*fna and .faa files contain the same number of sequences and a single instance of eack genome, as expected for orthologous clusters
  • next it will perform the multiple sequence alignments in parallel for the 31 consensus loci to generate codon-alignments
  • the phi(w) test is the first filter in the pipeline, identifying recombinant sequences. Note that some of the loci lack of sufficient polymorphisms for the test to work, as signaled by the warning messages. There is no evidence for recombinant loci in the test set
  • parallel IQ-TREE gene tree searches follow
  • the kdetrees test identifies “outlier phylogenies”, one outlier identified along with 5 loci that produce “trees” with < 5 branches
  • average support values of gene trees are computed for the remaining 25 loci to discard poorly resovled ones: 10 in this dataset
  • a supermatrix is generated from the 15 alignments passing the previous filters
  • IQ-TREE searches for the best model and tree for the supermatrix, reporting the best model and lnL score found
  • the script finally does some cleanup in the different working directories

4.2.1.1 The PhiPac directory holds the results of the recombination tests


# >>> explore the directories:
ls
cd get_phylomarkers_run_AIR1tDNA_k1.5_m0.7_Tmedium_30Jan18 # holds the compressed input 
                                                           # FASTAS and alignments

# >>> cd into PhiPac/ dir, which holds the results of the recombination tests.
cd PhiPack 

# have a peak a the file
head Phi_results_30Jan18.tsv
## ./1961_hypothetical_protein_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1962_DNA_topoisomerase_II_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1964_hypothetical_protein_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1966_site-specific_DNA_me_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1967_hypothetical_protein_cdnAln_Phi.log   1   1   TOO_FEW_INFORMATIVE_SITES
## ./1968_hypothetical_protein_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1969_putative_type_I_rest_cdnAln_Phi.log   1   1   TOO_FEW_INFORMATIVE_SITES
## ./1970_KorB_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1971_ParA_cdnAln_Phi.log   1.00e+00    1.00e+00
## ./1984_hypothetical_protein_cdnAln_Phi.log   1.00e+00    1.00e+00

4.2.1.2 The non_recomb_cdn_alns/non_recomb_FAA_alns directories hold the results of phylogenetic tests on gene trees: kdetrees, model-selection and signal

Depending on the type of the input sequences (-t DNA|PROT), the results of the different phylogenetic filters performed on gene trees will live either in the non_recomb_cdn_alns (-t DNA) or in the non_recomb_FAA_alns (-t PROT) directories.

# >>> cd into the directory holding the non-recombinant sequences, 
# which contains the results of several phylogenetic attributes of the loci
cd non_recomb_cdn_alns/

# 1.1 to identify the kde-test outlier loci see the contents of kde_outliers, in this case only 1 "outlier" locus was detected
ls kde_outliers

# 1.2 the details of the kdetrees test are provided in tsv-formatted file kde_dfr_file_all_gene_trees.tre.tab
head kde_dfr_file_all_gene_trees.tre.tab
## file kde_topo_dens   kde_topo_test   kde_bl_dens kde_bl_test kde_bl_topo_test
## 1961_hypothetical_protein_cdnAln.fasta.treefile  0.288751935193142   ok  63.2675110841703    ok  ok
## 1962_DNA_topoisomerase_II_cdnAln.fasta.treefile  0.252919761356219   ok  28.7241410392773    ok  ok
## 1964_hypothetical_protein_cdnAln.fasta.treefile  0.287751455109637   ok  17.2677657886815    ok  ok
## 1966_site-specific_DNA_me_cdnAln.fasta.treefile  0.240400359789596   outlier 27.0113341481227    ok  outlier
## 1967_hypothetical_protein_cdnAln.fasta.treefile  0.252639796140819   ok  95.8112601171774    ok  ok
## 1968_hypothetical_protein_cdnAln.fasta.treefile  0.297536609843425   ok  96.2903342166101    ok  ok
## 1969_putative_type_I_rest_cdnAln.fasta.treefile  0.314498361683638   ok  95.2603170008101    ok  ok
## 1970_KorB_cdnAln.fasta.treefile  0.331685645962565   ok  79.0447022830493    ok  ok
## 1971_ParA_cdnAln.fasta.treefile  0.334991727858742   ok  55.4211725899592    ok  ok

# 2.1 have a look at the counts of best-fitting models selected
cat IQT_best_model_counts_for_gene_trees.tsv
## model    count
## HKY+F    2
## HKY+F+G4 1
## K2P  26
## K2P+G4   1
## TIMe 1

# 2.2 and locus-specific stats are provied in IQT_DNA_gene_tree_Tmedium_stats.tsv
head  IQT_DNA_gene_tree_Tmedium_stats.tsv
## alignment    wc_secs CPU_secs    lnL model   s_type
## ./1961_hypothetical_protein_cdnAln.fasta.log 0.164   0.112   -1107.994    K2P    IQTdnaTmedium
## ./1962_DNA_topoisomerase_II_cdnAln.fasta.log 0.251   0.184   -4124.482    HKY+F+G4   IQTdnaTmedium
## ./1964_hypothetical_protein_cdnAln.fasta.log 0.284   0.172   -1225.145    K2P+G4 IQTdnaTmedium
## ./1966_site-specific_DNA_me_cdnAln.fasta.log 0.204   0.108   -2183.089    K2P    IQTdnaTmedium
## ./1967_hypothetical_protein_cdnAln.fasta.log 0.089   0.072   -870.935     K2P    IQTdnaTmedium
## ./1968_hypothetical_protein_cdnAln.fasta.log 0.099   0.076   -1179.667    K2P    IQTdnaTmedium
## ./1969_putative_type_I_rest_cdnAln.fasta.log 0.064   0.060   -1111.931    K2P    IQTdnaTmedium
## ./1970_KorB_cdnAln.fasta.log 0.130   0.116   -2112.818    TIMe   IQTdnaTmedium
## ./1971_ParA_cdnAln.fasta.log 0.129   0.072   -1157.621    K2P    IQTdnaTmedium

# 3. graphical summaries of the results of the kdetrees and the distribuions of support-values are provided as svg files.
ls -1 *svg
## dotplot_and_bxplot_kdeDensity_dist_dissim_topo_TRUE.svg
## parallel_bxplots_kdeDensity_dist_dissim_topo_TRUE-FALSE.svg
## scatterplot_for_gene_tree_support_values.svg

Figure 9 below depicts the results of the non-parametric kdetrees test, run at the default stringency level of k = 1.5. As depicted on the graph, only one outlier is detected based on the topology (lower panel). Note that this and following graphs plot the results of a small (toy) dataset. They become more useful with larger ones.

IMPORTANT NOTE: to visualize the figures, you need to access the corresponding files from your local host, as the Docker container does not provide a graphical environment with visualization tools! Just open a second terminal (or use your system’s file browser) to access the working directory and display the figures.

Figure 10 depicts a scatterplot and a histogram summarizing the distribution of mean SH-support values for the 25 gene trees that reached this point in the pipeline.

4.2.1.3 The top_N_markers_geNperc directory holds the supermatrix and species tree estimated from the concatenated top-ranking markers and other phylogenetic analyses performed on them

Finally we will inspect the contents of the top_15_markers_ge70perc/ directory, which holds the top-scoring markers that passed the above-mentioned filters (Fig. 1). In this directory you will find the supermatrix resulting from their concatenation and the “species-tree” estimated from it under the best-fitting model identified by ModelFinder/IQ-TREE. It contains also the resutls of RF-distances of gene trees to species-tree and the global molecular clock test results if the main script was invoked with -R 1 -t DNA -K

# >>> inspecting the contents of the top_15_markers_ge70perc/ directory
cd top_15_markers_ge70perc

# The concatenation coordinates for the supermatrix are saved in concatenation_coordinates.txt
cat concatenation_coordinates.txt

## concatenation coordinates:
## 1961_hypothetical_protein_cdnAln.fasta = 1-615
## 1967_hypothetical_protein_cdnAln.fasta = 616-1113
## 1968_hypothetical_protein_cdnAln.fasta = 1114-1785
## 1970_KorB_cdnAln.fasta = 1786-2961
## 1971_ParA_cdnAln.fasta = 2962-3642
## 1984_hypothetical_protein_cdnAln.fasta = 3643-3975
## 1989_hypothetical_protein_cdnAln.fasta = 3976-4830
## 1990_hypothetical_protein_cdnAln.fasta = 4831-5346
## 1994_DSBA_oxidoreductase_cdnAln.fasta = 5347-6204
## 1995_putative_signal_pept_cdnAln.fasta = 6205-7110
## 1996_hypothetical_protein_cdnAln.fasta = 7111-7962
## 1997_hypothetical_protein_cdnAln.fasta = 7963-8403
## 1998_hypothetical_protein_cdnAln.fasta = 8404-8922
## 1999_hypothetical_protein_cdnAln.fasta = 8923-9381
## 2012_TraF_cdnAln.fasta = 9382-10395

# some phylogenetic properties of the markers are summarized in sorted_aggregated_support_values4loci.tab
# graphical analysis of RF-distances of gene-trees to the species-tree is found in 
#  scatterplot_RF-dist_of_gene_trees2concat_phylo.svg

# The final tree concat_nonRecomb_KdeFilt_iqtree_GTR+F+ASC_ed.sptree can be conveniently 
#  visualized and edited with figtree
figtree concat_nonRecomb_KdeFilt_iqtree_GTR+F+ASC_ed.sptree &

Figure 11 depicts the RF-distances of the 15 top-scoring markers to the ML tree inferred under the GTR+F+ASC substitution model from the concatenation of these loci (shown in Fig. 12)

Figure 12 depicts the best ML “species tree” inferred under the GTR+F+ASC substitution model from the supermatrix of 15 top-scoring markers. The nodes are colored according to the legend. The values on the left correspond to approximate Bayes branch support values and those to the right to the UFBoot values described in the manual.

4.3 Estimating the ML pan-genome phylogeny of the pIncA/C plasmids with GET_PHYLOMARKERS

# assumes that you are within test_sequences/pIncAC_homologues/, or in test_sequences/pan_genome
cd pan_CM

# 1. find the pangenome_matrix
ls pangenome_matrix*


# 2. run estimate_pangenome_phylogenies.sh launching 5 independent iqtree searches, 
#    fitting binary (BIN) models
estimate_pangenome_phylogenies.sh -f pangenome_matrix_t0.fasta -r 5

4.3.1 Inspecting and visualizing the output of the ML pan-genome tree

The estimate_pangenome_phylogenies.sh reports its progress to the screen, informing that: - it created and moved into subdir iqtree_PGM_5_runs, which holds the results of the analysis - the best-fitting binary model was GTR2+FO - after determining the best model, it performs 10 independent IQ-TREE searches - Best IQ-TREE run was: abayes_UFBboot_run5.log:BEST SCORE FOUND : -1897.260 - wrote file best_PGM_IQT_abayes_UFBboot_run9_GTR2+FO.treefile in pan_CM/iqtree_PGM_5_runs/iqtree_10_runs

Lets have a look at the tree search profile

cd iqtree_PGM_10_runs/iqtree_10_runs

cat sorted_lnL_scores_IQ-TREE_searches.out
## abayes_UFBboot_run3.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run2.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run4.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run1.log:BEST SCORE FOUND : -1897.260
## abayes_UFBboot_run5.log:BEST SCORE FOUND : -1897.260

No wonder, in such a small dataset, all searches find exactly the same best tree. However, this will be less likely the case as the number of sequences in the dataset increases, as the size of the tree-spece increases factorialy with each new sequence.

figtree best_PGM_IQT_abayes_UFBboot_run5_GTR2+FO.treefile &

Figure 13 displays the best ML pan-genome tree, again using figtree. The nodes are colored according to the legend. The first value corresponds to approximate Bayes branch support values and second ones to the UFBoot values described in the manual.

4.4 Estimating the pan-genome phylogeny of the pIncA/C plasmids under the parsimony criterion with GET_PHYLOMARKERS

# assumes that you are within test_sequences/pIncAC_homologues/ or test_sequences/pan_genome
cd pan_CM

# 1. find the pangenome_matrix
ls pangenome_matrix*


# 2. run estimate_pangenome_phylogenies.sh launching 10 independent iqtree searches on 
#     all available cores (-n 4 in this case), fitting binary (BIN) models

estimate_pangenome_phylogenies.sh -c PARS -R 3 -i pangenome_matrix_t0.phylip -n 4 -b 25 -j 10 -t 1

4.4.1 Inspect and visualize the output of the parsimony pan-genome tree

After changing into the boot_pars/ dir we can edit and visualize the most parsimonious tree found among 100 pars (from the PHYLIP package) searches split on 4 cores (-n 4) with 10 taxon jumples searches using again figtree. See the manual for the details.

cd boot_pars
figtree full_pars_tree_rooted_withBoot_ed.ph &

Figure 14 displays the most parsimonious pan-genome tree, again using figtree. The nodes are colored according to the legend. The nodes are colored according to the legend, which represents standard bootstrap support values computed by seqboot from the PHYLIP package.

If you are running the tutorials from a Docker image instance, do you remember how to exit the container? It’s with exit

That’s it, enjoy!

5 Developers

The code is developed and maintained by Pablo Vinuesa at CCG-UNAM, Mexico and Bruno Contreras-Moreira at EEAD-CSIC, Spain. It is released to the public domain under the GNU GPLv3 license.

6 Citations

7 Source

8 Docker images for GET_HOMOLOGUES and GET_PHYLOMARKERS

9 Documentation

10 Acknowledgements

10.1 Personal

We thank Alfredo J. Hernández and Víctor del Moral at CCG-UNAM for technical support.

10.2 Funding

We gratefully acknowledge the funding provided by DGAPA-PAPIIT/UNAM (grants IN201806-2, IN211814, IN206318 and IN209321) and CONACyT-Mexico (grants P1-60071, 179133, FC-2015-2-879 and A1-S-11242) to Pablo Vinuesa, as well as the Fundación ARAID,Consejo Superior de Investigaciones Científicas (grant 200720I038 and Spanish MINECO (AGL2013-48756-R) to Bruno Contreras-Moreira.