1 About the tutorials
- 1.1 ASSUMPTIONS
  - 1.1.1 RELATED RESOURCES
- 1.2 AUTHOR
  - 1.2.1 SOURCE REPOSITORY
  - 1.2.2 LICENSE
2 Multiple sequence alignment (MSA) using the CLUSTAL family of programs
3 PRELIMINARIES - setting up your working directory
4 Clustalw: background and exercises
5 Multiple sequence alignment with clustal${\Omega}$
6 Profile hidden Markov models, and HMMER3
7 Recommended readings and references

1 About the tutorials

The following tutorials teach how to run clustalw2 and clustal-omega from the command line to generate standard, profile-profile, sequence-profile, and HMM-assisted multiple sequence alignments (MSAs).

It will also demonstrate how to build a profile hidden Markov models (HMMs) from an MSA, calibrate, and use it to search for distant homologs with high sensitivity and specificity, with programs from the HMMER3 suite.

All exercises will be performed in a Linux environment. Therefore, this tutorial will also introduce some useful AWK, Bash, and Perl idioms and scripts to analyze the results efficiently from the command line.

1.1 ASSUMPTIONS

This tutorial expects that the user has access to a Unix/Linux environment with the standard Bash shell installed, as well as the clustalw2, clustal-omega and HMMER3 programs installed, and found in the $PATH. These programs are freely available for download from the clustal.org and HMMER.org servers.

1.2 AUTHOR

These tutorials were designed and written by Pablo Vinuesa, CCG-UNAM, Cuernavaca, Mexico (X - @pvinmex), for the International Workshops on Bioinformatics, TIB2024, my course on comparative genomics taught at the Bachelor’s Program in Genome Science, LCG-UNAM, and the Workshop on Genome Sciences, Facultad de Ciencias, UNAM. Although these courses are taught in Spanish, I’ve written this and the other TIB2024 tutorials available from my GitHub repository in English, hoping that they will be helpful to a broader readership.

1.2.1 SOURCE REPOSITORY

This and multiple other tutorials are available through the following TIB-filoinfo GitHub repository, under the licensing terms provided below.

1.2.2 LICENSE

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0

2 Multiple sequence alignment (MSA) using the CLUSTAL family of programs

Clustal.org Webpage.

Multiple sequence alignments (MSAs) are essential in most bioinformatics analyses that involve comparing homologous sequences. The exact or exhaustive approach to computing an optimal alignment has a computational complexity of $O(L^N)$ for $N$ sequences of length $L$, making it prohibitive for even small numbers of sequences (Sievers et al. 2011). Therefore, all MSA algorithms implement some sort of heuristics (see Katho K. ed. 2021), such as the progressive multiple sequence alignment algorithm first implemented in the the popular Clustal family programs.

Clustal programs are used for multiple sequence alignment of DNA and protein sequences. The software and its algorithms have gone through several iterations, with $Clustal{\Omega}$ (Omega) being the latest version, published in of 2011 (Sievers et al. 2011). It is available as standalone software, via a web interface, and through a Web server hosted by the European Bioinformatics Institute.

Clustal has been an important bioinformatic software, with two of its academic publications ranking among the top 100 most cited papers of all time.

Here we will demonstrate the use of the current command-line versions: clustalw2/clustalX and clustal-omega to perform:

Standard multiple sequence alignments of protein sequences
Profile to profile (p2p), and sequence to profile (s2p) alignments
Build CDS alignments guided by the alignment of the the corresponding translation products (proteins)
Use a profile hidden Markov model to guide the alignment of protein sequences using $clustal-omega$

Both $Clustalw2$ and $Clustal{\Omega}$ can read multiple sequence formats, including CLUSTAL and FASTA, and write the computed alignments in even more formats, including STOCKHOLM, NEXUS, and PHYLIP.

3 PRELIMINARIES - setting up your working directory

The following code snippets will set up the working directory for the tutorials, and fetch or generate symbolic links to the required sequences and scripts.

# i) Change directory into your working directory and make the following new directory to hold the sequence data for the exercises
if [ ! -d sesion3_clustal_y_hmmer ]; then
    mkdir sesion3_clustal_y_hmmer && cd sesion3_clustal_y_hmmer
else
   cd sesion3_clustal_y_hmmer
fi 

# ii) use wget on the command line to fetch the required files from my TIB-filoinfo GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion4_alineamientos/sequences_for_alingment.tgz

if [ -s sequences_for_alingment.tgz ] 
then
     if file sequences_for_alingment.tgz | grep gzip &> /dev/null
     then 
         echo "unpacking sequences_for_aligment.tgz ..."
         tar -xzf sequences_for_alingment.tgz 
     else 
         echo "ERROR: This is not a gzip file"
     fi
else
    echo "ERROR: sequences_for_aligment.tgz not found"
fi

# iii) Download the required scripts to your ~/bin directory
base_url=https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/refs/heads/master

if [ -d ~/bin ]; then
    cd ~/bin
else
    mkdir ~/bin && cd ~/bin
fi

# This is a Bash array holding the list of scripts to be downloaded
scripts=(col_sumStats.sh split_fasta.pl translate_fastas.pl prot2cdnAln.pl convert_aln_format_batch_bp.pl fasta_toolkit.awk extract_N-or_C-terminal_regions_and_compute_counts.pl select_sequences_by_ID.pl)

# Iterate over the array elements, fetching each script with 'wget -c' and making them executable with 'chmod 755'
for f in "${scripts[@]}"
do
    wget -c "$base_url"/"$f"   
    [ -s "$f" ] && chmod 755 "$f"
done 

# Change directory back to the previous working directory
cd -

# list the directory contents
pwd && ls

# make a symlink to the Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa sequences and Ras_fam_PF00071.hmm
ln -s /export/space3/users/vinuesa/DBs/sequences/Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa .
ln -s /export/space3/users/vinuesa/DBs/Pfam/Ras_fam_PF00071.hmm .

We are ready to start the hands-on tutorials on running clustal and hmmer from the command line.

4 Clustalw: background and exercises

Clustalw implements a fast version of the progressive multiple sequence alignment algorithm (sketched below), featuring a fast and simple method for making “guide trees.” These are computed using fast clustering methods (UPGMA or NJ) of a distance matrix, which is calculated from the sequences using either exhaustive (dynamic programming) or heuristic (k-mer distances) methods.

Schematic representation of the progressive MSA algorithm.

The resulting guide trees are used to define the order of alignment during the later progressive alignment phase. The general idea is to start with alignments of just two sequences, usually the closest ones in the data set. The alignment is then built up by aligning sequences to sequences (s-s), alignments (profiles) with each other (p2p alignments), or sequences to alignments (s2p), according to the topology of the guide tree.

Definng the order of pair-wise s-s, s-p, and p-p alignments used by the progressive MSA construction algorithm with the aid of a guide tree.

The complexity of the guide tree construction is usually $O(N^2)$ for $N$ sequences because all $N$ sequences have to be compared to each other. Earlier versions of Clustal used fast word based alignments for these comparisons, which made it memory efficient and fast enough to work on personal computers with up to a few hundred sequences (Sievers & Higgins, 2018).

Clustalw can be freely obtained from the Clustalw2 webpage. Clustal 2 comes in two flavors: the command-line version Clustal W, and the graphical version Clustal X. Precompiled executables for Linux, Mac OS X, and Windows are available.

Clustalw2 Webpage.

4.1 Basic Clustalw2 command-line options

Here we are going to use only the command-line version Clustalw.

Note that:

the $clustalw$ and $clustalw2$ calls are equivalent, and execute the same clustalw version 2.1 binary
the alignment algorithms in $Clustalw$ and $Clustalx$ are exactly the same
only $Clustalx$ has a graphical user interface

Below is a compilation of some of the most frequently used $clustalw$ command-line options:

#======================================================================================
# Basic clustalw usage and options
#--------------------------------------------------------------------------------------

# explore the manuals
man clustalw

clustalw -help
clustalw -fullhelp

# minimal clustal command line call
clustalw -infile=fasta_no_alineado -outfile=archivo_cluwaln.afa

# clustalw call with modified parameters
clustalw -infile=archivo.fasta -align -pwmatrix=blosum -pwgapopen=12
-pwgapext=0.2 -matrix=blosum -gapopen=12 -gapext=0.2 -outorder=aligned 
-convert -outfile=archivo_aln1.phy -output=<(phylip|clustal|fasta|nexus)>

4.2 Standard and customized clustalw alignment of prokaryotic and eukaryotic GDP (Glycerol-3-phosphate dehydrogenase [NAD(+)], cytoplasmic) protein sequences: the impact of alignment parameters

The next lines show how to align the the provided FASTA files with GDP sequences from prokaryotes and eukaryotes, writing the resulting multiple sequence alignments (MSAs) in FASTA or Clustal formats.

# 1. default clustalw alignments
clustalw -infile=4_GDP_procar_ualn.faa -outfile=4_GDP_procar_cluAln.faa -output=fasta
clustalw -infile=6_GDP_eucar_ualn.faa -outfile=6_GDP_eucar_cluAln.faa -output=fasta

# 2. This clustalw call uses ~ 10% higher gap-open and gap extension penalties, both for the pairwise and multiple sequence alignment blocks
clustalw -infile=6_GDP_eucar_ualn.faa -outfile=6_GDP_eucar_cluAln2.aln -output=clustal -pwgapopen=13 -pwgapext=0.22 -gapopen=13 -gapext=0.22

4.3 Profile-profile alignment with clustalw, and writing output in Clustal’s native clustal format

# 1. compute a profile to profile alignment of the default alignments
clustalw -profile -profile1=6_GDP_eucar_cluAln.faa -profile2=4_GDP_procar_cluAln.faa -outfile=10_GDP_eucar_prokar_prof2prof.aln -output=clustal

# 2. compute a profile to profile alignment using the eukaryotic 6_GDP_eucar_cluAln2.aln alignment
clustalw -profile -profile1=6_GDP_eucar_cluAln2.aln -profile2=4_GDP_procar_cluAln.faa -outfile=10_GDP_eucar_prokar_prof2prof2.aln -output=clustal

4.3.1 The clustal format

The Clustal format can be conveniently visualized directly in the terminal.

The following command will allow us to compare the N-terminal regions of both profile alignments.

# display the 10_GDP_eucar_prokar.aln MSA written in clustal format
head -15 10_GDP_eucar_prokar_prof2prof.aln 10_GDP_eucar_prokar_prof2prof2.aln

==> 10_GDP_eucar_prokar_prof2prof.aln <==
CLUSTAL 2.1 multiple sequence alignment


gpdhuman        ---------------MASKK----------------VCIVGSGNWGSAIAKIVGGNAAQL
gpdarabit       ----------------AGKK----------------VCIVGSGDWGSAIAKIVGGNAAQL
gpdamouse       ----------------AGKK----------------VCIVGSGNWGSAIAKIVGSNAGRL
gpdadrome       ---------------MADKVN---------------VCIVGSGNWGSAIAKIVGANAAAL
gpdacaeel       ---------------MSPKK----------------VTIIGSGNWGSAIARIVGSTTKSF
gpd1yeast       MSAAADRLNLTSGHLNAGRKRSSSSVSLKAAEKPFKVTVIGSGNWGTTIAKVVAENCKGY
gpdaecoli       -------------MNQRNAS----------------MTVIGAGSYGTALAITLARNGHEV
gpdahaein       -------------MITSQTP----------------ITVLGAGSYGTALAITFSRNGSPT
gpdpseu         -----------------------------------------MTSCCGAVAKTRWRN----
gpdabacsu       -----------------MKK----------------VTMLGAGSWGTALALVLTDNGNEV
                                                           .   ::*     .    
                                                           
==> 10_GDP_eucar_prokar_prof2prof2.aln <==
CLUSTAL 2.1 multiple sequence alignment

gpdhuman        -------------------------------MASKKVCIVGSGNWGSAIAKIVGGNAAQL
gpdarabit       --------------------------------AGKKVCIVGSGDWGSAIAKIVGGNAAQL
gpdamouse       --------------------------------AGKKVCIVGSGNWGSAIAKIVGSNAGRL
gpdadrome       ------------------------------MADKVNVCIVGSGNWGSAIAKIVGANAAAL
gpdacaeel       -------------------------------MSPKKVTIIGSGNWGSAIARIVGSTTKSF
gpd1yeast       MSAAADRLNLTSGHLNAGRKRSSSSVSLKAAEKPFKVTVIGSGNWGTTIAKVVAENCKGY
gpdaecoli       -----------------------------MNQRNASMTVIGAGSYGTALAITLARNGHEV
gpdahaein       -----------------------------MITSQTPITVLGAGSYGTALAITFSRNGSPT
gpdpseu         -----------------------------------------MTSCCGAVAKTRWRN----
gpdabacsu       ---------------------------------MKKVTMLGAGSWGTALALVLTDNGNEV
                                                           .   ::*     .    
... TRUNCATED


Note that clustal uses the following symbols below each alignment block to highlight conserved residues:
*   (asterisk)  positions that have a single and fully conserved residue
:   (colon)     conserved: conservation between groups of strongly similar properties (score > 0.5 on the PAM 250 matrix)
.   (period)    semi-conserved: conservation between groups of weakly similar properties (score ≤ 0.5 on the PAM 250 matrix)
    (blank)     non-conserved

Notice the long N-terminal extension of the yeast sequence.

What do you think about the two MSAs, which one do you think is more likely correct, and why?

4.3.2 Using seaview to visualize and edit MSAs

Seaview is a multi-platform, graphical user interface for multiple sequence alignment and molecular phylogeny.

Seaview reads and writes various file formats (NEXUS, MSF, CLUSTAL, FASTA, PHYLIP, MASE, Newick) of DNA and protein sequences and of phylogenetic trees.
Seaview drives programs muscle or Clustal Omega for multiple sequence alignment, and also allows to use any external alignment algorithm able to read and write FASTA-formatted files.
Seaview drives the Gblocks program to select blocks of evolutionarily conserved sites.
Seaview computes phylogenetic trees by
- parsimony, using PHYLIP’s dnapars/protpars algorithm,
- distance, with NJ or BioNJ algorithms on a variety of evolutionary distances,
- maximum likelihood, driving program PhyML 3.1.
Seaview can use the Transfer Bootstrap Expectation method to compute the bootstrap support of PhyML and distance trees.
Seaview uses the Treerecs method to reconcile gene and species trees.
SeaView prints and draws phylogenetic trees on screen, SVG, PDF or PostScript files.
SeaView allows to download sequences from EMBL/GenBank/UniProt using the Internet.

Seaview can be run from the command line, and the options can be found with:

# display the help menu
seaview -h

or here: seaview command manual

To visualize an alignment, simply call $seaview$ with the alignment file name as single argument on the command line:

seaview 10_GDP_eucar_prokar_prof2prof2.aln &

Visualization of an MSA with seaview.

4.4 Multiple sequence alignment of protein-coding sequences (CDSs)

We will first attempt to perform a default MSA of nine full-length leuA sequences from diverse Bacillales using $clustalw$

Lets firs count the sequences, and have a look at them

# count sequences
grep -c '^>' leuA-Bacillales.fas

# display file
cat leuA-Bacillales.fas

9
>B_anthracis_Ames 30261500
atgaagcaaattttgttcatggatacgacgctacgtgatggcgaacaatcaccaggagtaaatttaaatgaacaagagaa
attgcaaattgcaaggcaattagagagactagggattcatgtcatggaagctggctttgcagcagcttccgaaggggatt
ttcaatcggtaaagcgcatcgcaaatacaattcaaaatgctacagttatgagtttagcaagagcgaaagaaagtgatatt
cgaagagcatatgaagctgtgaaaggtgcggtatctcctcgtttacacgtatttttagctacgagtgatattcatatgaa
gtataaactttgtatgtcaaaagaagatgtattagatagtatttatcgatcggttacacttgggaaatcgttatttccga
cagtacaattttcagcagaagatgcaacaagaacagcaagagattttttagctgaagcggtagaagttgcgattcgtgca
ggagcgaatgtaattaatattccagatacagttgggtatacgaatccagaagaatattatgctctttttaaatatttaca
agaatccgttccttcatatgaaaaagcaattttctcttgtcattgtcatgatgatcttgggatggcggtagcaaattcat
tagctgcagttgaaggcggagcgttgcaagtagaaggaacaattaatgggattggagaaagagcagggaatgcggcgtta
gaagaggttgcagttgctcttcatattaggaaagatttctataaagcagagccttctatgacgctaaaagaaattaaagc
gacaagtacattagtaagcagattaacgggtatggttgtatcgaaaaataaagcgattgttggagcaaatgcattcgctc
atgaatcaggcattcatcaagatggtgtattaaaagaagtgacaacatatgaaattattgaaccagcgctagtaggcgaa
tctcaaaatctatttgtacttggaaaacattctgggcgtcacgcgtttacagaaaaaatgaaagaacttggctatgaatt
tacaaacgacgagcgggatgcagtatttgaagcatttaaaaaattagctgatcgtaagaaggaaattactgaggaagatt
tacgagcacttatgcttggtgaagcagcatttgcggctcagcaatataacattacgcaattgcaagtacatttcgtatca
aatagtacacaatgtgcgacagttgtactaaaagatgaggaaggaaatgtattcgaagatgcagctactggctctggtag
tattgaagcgatttataatgcaatccaaagaattttaggattggaatgtgaattggcggattatcgcatacaatctatta
cacaaggtcaagatgcactggctcatgttcatgttgaattaaaagaaggagcacatcaagtatcaggttttggtgttgca
caagacgtattggaagcgtcggcaagagcatatgtccacgcagcggggaaattgaaatcttttatccagcttgtgaaata
a
... truncated

Do you think the sequence shown above might be a complete CDS? Why?

4.4.1 Default alignment of CDSs (the incorrect way of doing it!)

Align the leuA sequences under default parameters:

# generate the MSA
clustalw -infile=leuA-Bacillales.fas -outfile=leuA-Bacillales.aln

# display the MSA
cat leuA-Bacillales.aln

... truncated
B_anthracis_Ames                TGTTGAATTAAAAGAA--GGA-GCACATCAAGTATCAGGTTTTGGTGTTG
B_thuringiensis_konkukian       TGTTGAATTAAAAGAA--GGA-GCACATCAAGTATCAGGTTTTGGTGTTG
B_cereus_ATCC14579              TGTTGAATTAAAAGAA--GGT-AGTCACCAAATATCAGGTTTCGGAGTTG
B_licheniformis_ATCC_14580      CGTCAGGGTG-ATGAT--CGATGGAAAAGAATCGGGAGGACGCGGGGTTG
B_subtilis                      TGTAAGAGTT-TTGGT--GAACGGAAAAGAATCAGCAGGGCGGGGCATAG
B_halodurans                    TGTTCAAATGGATTAT--CAA-GGAGTCGTCTCAAGTGGACGCGGCACGG
O_iheyensis                     TGTTCAAATT-ATTAT--TAATGGAGAAACAATCAATGGTAGAGGGTCTG
L_innocua                       TGTCGTCATTAAAGATGATAATGGCACTGAATTCCATGGTATCGGTATCG
L_monocytogenes_4b_F2365        TGTCGTCATTAAAAATGACAAAGGCGCTGTGTTCCATGGTATCGGTATTG
                                 **     *                            **    **    *

B_anthracis_Ames                CACAAGACGTATTGGAAGCGTCGGCAAGAGCATATGTCCACGCAGCGGGG
B_thuringiensis_konkukian       CACAAGACGTATTGGAAGCGTCGGCAAGAGCATATGTCCACGCAGCGGGG
B_cereus_ATCC14579              CGCAAGACGTATTGGAAGCGTCGGCGAGAGCTTATGTCCATGCAGCAGGG
B_licheniformis_ATCC_14580      CCCAGGATGTTCTCGAAGCATCTGCCAAAGCCTACTTGAATGCTGTGAAC
B_subtilis                      CGCAAGACGTATTAGAAGCATCAGCGAAAGCCTATTTGAACGCAGTAAAC
B_halodurans                    CCCATGATGTGTTAGAAGCGTCAGCAAAAGCCTATTTAAACGCAGTTAAT
O_iheyensis                     CTCAGGATGTGCTTCAAGCGTCTGCAGTAGCATTCATTCAGGCAGTAAAT
L_innocua                       ATTTTGATGTTTTAACAGCTAGTGCTAAAGCTTATTTGCAAGCATCTGGA
L_monocytogenes_4b_F2365        ATTTTGACGTTTTAACTGCTAGTGCTAAAGCTTATTTGCAAGCATCAGGC
                                     ** **  *    **    **   *** *   *  * **       

B_anthracis_Ames                AAATTGAAATCTTTTATCCAGCTTGTGAAATAA-----------------
B_thuringiensis_konkukian       AAATTGAAATCTTTTATCCAGCTTGTGAAATAA-----------------
B_cereus_ATCC14579              AAATTAAAATCTCTTATAGCGCTTGTGAAATAA-----------------
B_licheniformis_ATCC_14580      CGCTATCTCGTATTGAAGACGAATACGGAAGGATTGTCAAAACAGGCGGC
B_subtilis                      CGTCAATTGGTTTTCCAGTCGAATATGAGCGGATTGAAAAACCACACAGC
B_halodurans                    CG--AACGATTAATCGAAAGAAATATGCAGAAGTTTATGGATCGAA----
O_iheyensis                     CGTTATTACGT--TCAAAAGAAA-----GCAAATC--TAACTCGACTTGT
L_innocua                       AAAAGCAAATCTACAAGTAAGCA----AGCTGATTTCGAGGAGGTAAAGT
L_monocytogenes_4b_F2365        AAAAGCAAAACCGCCAGTAAGCA----AGCTGATTTCGAGGAGGTAAAAT
... truncated

Carefully scrutinize the MSA shown above. What do you think about the alignment’s quality? Provide your comments.

4.4.2 The correct way of aligning CDSs

The correct way of aligning CDSs requires their translation, and alignment at the protein level, as shown in the following figure:

Strategy to peform CDS alignments correctly.

4.4.2.1 Tranlate CDSs to proteins with the aid of translate_fastas.pl

Here we assume that all CDSs in the FASTA file are in the same +1 frame.

translate_fastas.pl -e fas -t 11

Lets visualize the translated products:

cat leuA-Bacillales_translated.faa

>B_anthracis_Ames 30261500
MKQILFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRI
ANTIQNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLDS
IYRSVTLGKSLFPTVQFSAEDATRTARDFLAEAVEVAIRAGANVINIPDTVGYTNPEEYY
ALFKYLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAAL
EEVAVALHIRKDFYKAEPSMTLKEIKATSTLVSRLTGMVVSKNKAIVGANAFAHESGIHQ
DGVLKEVTTYEIIEPALVGESQNLFVLGKHSGRHAFTEKMKELGYEFTNDERDAVFEAFK
KLADRKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVFED
AATGSGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGAHQVSGFGVA
QDVLEASARAYVHAAGKLKSFIQLVK*
>B_cereus_ATCC14579 30019549
LFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRIANTI
QNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLNSIHRS
VTLGKSLFPTVQFSAEDATRTSRAFLAEAVEIAIRAGANVINIPDTVGYTNPEEYYSLFE
YLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAALEEVA
VALHIRKDFYKAEPSMTLKEIKATSILVSRLTGMVVPKNKAIVGANAFAHESGIHQDGVL
KEVTTYEIIEPALIGESQNLFVLGKHSGRHAFTEKMKELGYEFTNEERDAAFEAFKKLAD
RKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVYEDAATG
SGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGSHQISGFGVAQDVL
EASARAYVHAAGKLKSLIALVK*
... truncated

Note that the stop codons were translated as ’*’ symbols. These are not amino acids, and should be removed. This can be easily achieved with a simple $sed$ one-liner:

# use sed to replace the terminal * symbols by nothing
sed 's/\*$//' Bacillales_translated.faa > k && mv k Bacillales_translated.faa

# visualize the edited translation products
cat leuA-Bacillales_translated.faa

>B_anthracis_Ames 30261500
MKQILFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRI
ANTIQNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLDS
IYRSVTLGKSLFPTVQFSAEDATRTARDFLAEAVEVAIRAGANVINIPDTVGYTNPEEYY
ALFKYLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAAL
EEVAVALHIRKDFYKAEPSMTLKEIKATSTLVSRLTGMVVSKNKAIVGANAFAHESGIHQ
DGVLKEVTTYEIIEPALVGESQNLFVLGKHSGRHAFTEKMKELGYEFTNDERDAVFEAFK
KLADRKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVFED
AATGSGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGAHQVSGFGVA
QDVLEASARAYVHAAGKLKSFIQLVK
... truncated

4.4.2.2 Align the translated products keeping the original input order in the output MSA

Note that it is critical to conserve the original ordering of the sequences from the input file in the output MSA using the $-OUTORDER=input$ option.

# print clustalw help
clustalw -help

# run clustalw
clustalw -infile=leuA-Bacillales_translated.faa -outfile=leuA-Bacillales_translated_clu.faa -output=fasta -OUTORDER=input

4.4.2.3 Use the prot2cdnAln.pl Perl script to reconstruct the CDS MSA using the translation product MSA as guide alignment.

# print the script's help to STDOUT
prot2cdnAln.pl -h

# run the script
prot2cdnAln.pl leuA-Bacillales.fas leuA-Bacillales_translated_clu.faa

4.4.2.4 Compare the leuA-Bacillales_cdnaln.fas codon alignment with the default leuA-Bacillales.aln MSA

Display both the correct codon and default MSAs of the leuA CDSs:

# this one is the correct codon/CDS alignment
seaview leuA-Bacillales_cdnaln.fas &

# this one is the default, incorrect codon/CDS alignment
seaview leuA-Bacillales.aln &

Comparison of the codon and default MSAs of the leuA CDSs.

The output shown above clearly demonstrates that when aligning CDSs, we should first translate them to proteins, perform the alignment of the translation products, and finally reconstitute the CDS alignment using the protein MSA as a guide or template.

5 Multiple sequence alignment with clustal${\Omega}$

Clustal${\Omega}$ is the last incarnation of the Clustal family of MSA programs. It offers a significant increase in scalability over previous versions thanks to the implementation of novel algorithms that avoid bottlenecks of the classic progressive multiple sequence alignment algorithms (Sievers et al. 2011).

5.1 What are the differences between Clustalw and Clustal-omega?

Clustalw, and earlier Culstal versions, use fast word-based alignments (k-mers) for these comparisons, which made it memory efficient and fast enough to work on personal computers. However, as the number of sequences grows above a few thousand, the $O(N^2)$ complexity associated with the computation of pairwise distance matrices to generate “guide trees” becomes very time consuming.

$Clustal\Omega$ implements an $O(NlogN)$ method called $mBed$ (Sievers et al. 2011) that allows guide trees of hundreds of thousands of sequences to be computed by restricting the calculation of sequence alignment scores to NLog(N). This is the main algorithmic improvement providing Clustal Omega with the capacity and scalability required for current genomic data sets.
The second main development in $Clustal\Omega$ was the replacement of the conventional dynamic programming, and profile alignment, with an alignment engine provided by $HHalign$ capable of aligning profile hidden Markov models (HMMs). HMM-HMM alignments had been shown to have very high accuracy for profile HMM alignment (Söding 2005). The use of $HHalign$ provides $Clustal\Omega$ with greatly increased accuracy when compared to earlier Clustal programs, as measured on structure based alignment benchmarks.

This is the paper describing Clustal${\Omega}$, along with its abstract: Sievers et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11:7:539. doi: 10.1038/msb.2011.75

ABSTRACT Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.

$The Clustal\Omega Webpage.$

The $Clustal\Omega$ Webpage.

5.2 Default alignment of multiple FASTA files with clustal-omega using a for loop

We will repeat the MSA of the two FASTA files containing four bacterial and six eukaryotic GDP protein sequences using $clustalo$, called from a Bash $for$ loop to align all *ualn.faa files found in the current directory.

# call clustalo within a for loop to align all *ualn.faa files found in the working directory
for file in *ualn.faa; do clustalo -i $file -o ${file%.*}_cluOaln.phy --outfmt phy --output-order input-order --threads 4; done

5.3 Profile to profile (p2p) alignment

This $clustalo$ call performs a profile-profile alignment of the two input alignments.

clustalo --profile1 4_GDP_procar_ualn_cluOaln.phy --profile2 6_GDP_eucar_ualn_cluOaln.phy --infmt phy --outfmt clu -o 10_GDP_eucar_prokar_cluOp2p.aln --threads 4

5.4 Sequences to profile (s2p) alignment

This $clustalo$ call performs a sequence (4_GDP_procar_ualn.faa) to profile (6_GDP_eucar_ualn_cluOaln.phy) alignment.

clustalo --profile1 6_GDP_eucar_ualn_cluOaln.phy -i 4_GDP_procar_ualn.faa -o 10_GDP_eucar_prokar_cluOs2p.aln --outfmt clu --threads 4

5.5 Comparison of the p2p and s2p 10_GDP_eucar_proka alignments

The $paste$ command allows us to conveniently display the two MSAs side by side for their comparision.

paste 10_GDP_eucar_prokar_cluOp2p.aln 10_GDP_eucar_prokar_cluOs2p.aln

$Pasted Clustal\Omega p2p vs. s2p GDP alignments in clustal format.$

Pasted $Clustal\Omega$ p2p vs. s2p GDP alignments in clustal format.

Several differences can be seen between the p2p and s2p GDP alignments shown above. Which one do you thinks is the best?

To gain evidence in favor for one or the other, we will perform a new $clustalo$ alignment on the full 10 GDP sequence set using iterations and Kimura correction.

5.6 Alignment of the 10 GDP sequences with clustal-omega using iterations and Kimura correction

Iterations can be used in an attempt to improve the alignment by a process of repeated refinement (Sievers & Higgins, 2014).

Iterative refinement can be accomplished in two ways:

guide-tree refinement and/or
refinement using the multiple alignment output as an external profile HMM.

This process is illustrated in the bottom part of Figure shown below (Sievers & Higgins, 2014).

$Clustal\Omega default and iteration algorithm steps.$

$Clustal\Omega$ default and iteration algorithm steps.

The second iteration scheme uses a profile HMM, derived from an alignment of the input sequences, as an external profile. The first alignment gives an outline of the positions where residues are more likely to align in a second alignment round.

Both iteration methods can be carried out a number of times, separately or in combination. The default is to combine them, as will be shown in the protocol below.

# 1. combine the source FASTA files into a single one named 10_GDP_procar_eukar_ualn.fas
cat 4_GDP_procar_ualn.faa 6_GDP_eucar_ualn.faa > 10_GDP_procar_eukar_ualn.fas

# 2. Perform clustalo MSA with two iterations and using Kimura-corrected distances
clustalo -i 10_GDP_procar_eukar_ualn.fas --iter 2 --use-kimura -o 10_GDP_eucar_prokar_cluoAln2iter.aln --outfmt clu --threads 4

Clustal Omega will first construct a guide tree and align the sequences in the order specified by the guide tree. This alignment is then used in two ways:

A new guide tree is constructed, based on distances derived from the aligned sequences
The alignment is converted into a profile HMM

During the second round of alignment (the first iteration), sequences are aligned to the internally constructed profile HMM, pseudo-count information is transferred to the unaligned sequences, and the ‘enhanced’ sequences are aligned to each other in the order specified by the newly constructed guide tree. The final alignment is written in clustal format to 10_GDP_eucar_prokar_cluoAln2iter.aln .

The first iteration will be followed by another guide-tree construction and HMM pseudo-count transfer.

While the alignment accuracy is expected to be higher for the iterated alignments, so is also the response time. Every round of iteration can take up to three times longer than the initial alignment. For example, if the initial alignment took 1 min, then an alignment that was iterated twice might take up to 7 min; that is, 1 min for the initial alignment, and then two times 3 min for the iterations.

Unfortunately, experience has shown that the alignment accuracy cannot be improved indefinitely by iteration. Good improvements are often achieved with one or two iterations (Sievers & Higgins, 2014).

Let us compare the standard and iterated alignments of the 10_GDP_eucar_prokar sequences:

paste 10_GDP_eucar_prokar_cluoAln2iter.aln 10_GDP_eucar_prokar_cluOp2p.aln

$Clustal\Omega p2p vs. iteration alignment of 10_GDP_eucar_prokar sequences with Kimura distances.$

$Clustal\Omega$ p2p vs. iteration alignment of 10_GDP_eucar_prokar sequences with Kimura distances.

As shown above, there are minimal differences in the p2p vs. the iteration alignment with Kimura (corrected) distances performed on the full set of 10 GDP sequences. Can you spot them?

Based on the high similarity of both $Clustal\Omega$ alignments, we might conclude that the p2p and the full alignment with two iterations are better than the s2p alignment presented in a previous section.

5.7 CDS alignment of leuA-Bacillales.fas using $clustalo$

As discussed previously, CDSs should be aligned using their aligned translation products as a guide. Here is how to do it using $clustal\Omega$:

# 1. Translate the leuA-Bacillales.fas file with the aid of the translate_fastas.pl Perl script
translate_fastas.pl -e fas -t 11 

# 2. Remove the trailing *'s resulting from the translation of stop codons present in the DNA sequences
sed 's/\*$//' Bacillales_translated.faa > k && mv k Bacillales_translated.faa

# 3. Align the tranlation products keeping the input order
clustalo -i leuA-Bacillales_translated.faa -o leuA-Bacillales_translated_cluo.faa --output-order input-order --threads 4

# 4 Use the prot2cdnAln.pl Perl script to generate the codon alignment using the protein alignment as a guide
prot2cdnAln.pl leuA-Bacillales.fas leuA-Bacillales_translated_cluo.faa

5.7.1 Visualize the codon alignment with seaview

Finally, we will visualize the resulting codon alignment of the leuA CDSs at the DNA and protein levels.

seaview  leuA-Bacillales_cdnaln.fas &

$Clustal\Omega codon alignment of leuA CDSs.$

$Clustal\Omega$ codon alignment of leuA CDSs.

$Clustal\Omega codon alignment of leuA CDSs viewed as proteins.$

$Clustal\Omega$ codon alignment of leuA CDSs viewed as proteins.

5.8 Multiple sequence alignment using an external profile HMM

As previously introduced, a key feature of $Clustal\Omega$ is that it can take an external profile hidden Markov model of a protein family, and use it to aid in the alignment highly divergent members.

As mentioned by Sievers & Higgins (2014), the main problem with progressive multiple sequence alignment is that alignments that were made during the early stages of the MSA cannot be changed at a later stage. This has been summarized by the phrase “once a gap, always a gap”. Therefore, it would be very advantageous if the algorithm could be provided with some degree of ‘foresight’ to know in advance which alignment columns are conserved and important in the alignment (Sievers & Higgins, 2014). Some foresight can be communicated to $Clustal\Omega$ in the form of an external, previously contstructed profile hidden Markov model (profile HMM) of the protein family to be aligned.

The use of profile HMMs is particularly natural within $Clustal\Omega$, as its main alignment algorithm is based on alignment of profile HMMs (Söding, 2005; Sievers 2011). A profile HMM can be supplied to $Clustal\Omega$ as an ‘external profile’. The alignment of the unaligned residues will be guided by the distribution of residues and gaps in the HMM. Residues of the unaligned sequences will be aligned to the positions within the HMM that best fit the signature, and this is used to align those residues more accurately.

Obviously, the accuracy of the HMM will affect the quality of the HMM-guided MSA. However, experience has shown that external profile HMM information is rarely detrimental (Sievers et al., 2011). Use of HMMs as external profiles is therefore very robust. Its use is most helpful in the alignment of highly diverged sequences with low average pairwise sequence similarity. Such scenarios frequently lead to early misalignment and/or gaps, which can be partly avoided using background information provided by the HMM.

This procedure can be used if an existing multiple alignment of a set of sequences that has been carefully maintained is available. It can then be converted to a HMM and used to help align new sequences. Alternatively, HMMs can be downloaded from alignment databases such as PFAM. This procedure is also used to iterate the multiple alignment process, as described in the previous section.

5.8.1 Alignment of 146 human Ras GTPases using the PF00071 HMM

As a final exercise, we will align the classic set of 146 human Ras GTPases compiled by Wennerberg et al. (2005) with the aid of the PFAM00071.hmm model for the Ras superfamily, fetched from PFAM. Note that on the chaac server the alignment takes ~5 seconds to complete using 4 cores, which is almost twice the time it takes to align under the default $clustalo$ parameterization, as demonstrated below:

# default clustalo alignment
time clustalo -i Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa -o Human_Ras_Superfam_proteins_Wennerberg_JCS05_cluo.afa --outfmt fasta --threads 4

# clustalo using the Ras_fam_PF00071.hmm as HMM input
time clustalo -i Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa --hmm-in Ras_fam_PF00071.hmm -o Human_Ras_Superfam_proteins_Wennerberg_JCS05_PF00071.afa --outfmt fasta --threads 4

# default clustalo alignment
real    0m2.912s
user    0m7.995s
sys 0m0.136s

# clustalo using the Ras_fam_PF00071.hmm as HMM input
real    0m5.032s  <== just ~ 5 seconds with 4 threads
user    0m14.801s
sys 0m0.151s

Below you’ll see a screenshot of the central part of the HMM-guided alignment, the region modeled by the PFAM00071 profile, for a fraction of the 146 sequences, and displayed with $seaview$.

# visualize the PF00071.hmm-assisted alignment
seaview Human_Ras_Superfam_proteins_Wennerberg_JCS05_PF00071.afa &

Ras superfamily clustal-omega alignment using PFAM00071 as external profile HMM.

The alignment shown above was used to compute the maximum likelihood tree for the Ras superfamily presented below (Vinuesa, unpubl.)

Maximum likelihood phylogeny of 146 human Ras superfamily members. Vinuesa unpubl.

In the HMMER section that follows, we will focus on the Rab family of small GTPases, the largest one within the Ras superfamily.

5.8.2 Motif and logo analysis of highly conserved alignment regions of the Ras superfamily and their extraction using $seaview$

$Seaview$ can be used to easily extract certain alignment regions or columns under ‘Sites > Create Set > OK’. Then left-click on the white ‘X’ label immediately adjacent to the first conserved alignment column that you would like to keep, followed by a right click on the preceding ‘X’. This will exclude the N-terminal region flanking your region of interest (ROI). Next, move to C-terminal part of your ROI and repeat the left- and right-clicks to exclude the non-desired C-terminal region. Then type CTRL-A to select all sequences and save them under ‘Save > Save selection’. Note that you can un-select particular sequences before saving them to disk by left-clicking on their IDs.

The extracted alignment blocks can then be used to compute sequence logos and simple profiles that summarize the residue frequency distribution in such blocks using ad hoc Perl and R scripts, or Web-servers such as WebLogo 3.

Below the logo and motif representations of the first highly conserved (G1) region of Ras GTPases, including non-gaped flanking columns.

Alignment and sequence logo of the highly conserved G1 signature region of human Ras GTPases.

A simpler, less informative way to represent these highly conserved signatures is by a simple sequence motif summarizing the pattern, as shown below:

# sequence motif for the highly conserved G1 region of Ras GTPases with max. degeneracy set to 4
xxxxxxxx[GAE]xxxx[GAV][KT][TSGK]xxxxxxxxx

5.9 Clustal exercise

Perform the following tasks, documenting the commands used, and the results obtained, in a fashion similar to that used in previous sections of this tutorial.

Important note: all $clustalo$ alignments should be run with four threads using option –threads 4

5.9.1 16S rDNA sequence to profile alignment

Get the file 16S_problema.fna used in the BLASTN tutorial, and use the $fasta\_toolkit.awk$ script to determine their sequence lengths.

Generate a new 16S_problema_len_filt.fna file containing only sequences >= 450 bp.
How many sequences are contained the original file? How many are found in 16S_problema_len_filt.fna?

Use $lustal\Omega$ in sequence-to-profile alignment mode to align the 16S_problema_len_filt.fna sequences to the 16S_reference_profile.fst profile, which you can find on chaac here: /export/space3/users/vinuesa/DBs/profiles

5.9.2 Alignment of rpoB CDS sequences from Enterobacteriaceae

Explore and optimize the FASTA header (reducing redundancy, for optimal information content) for the rpoB_enterics.fasta file, which you can get from /export/space3/users/vinuesa/DBs/sequences on chaac.
translate the sequences and align them at the protein level
Make a [codon alignment]#(CDS_alignment) using the helper scripts provided in this tutorial for this task.

5.9.3 Finding highly conserved motifs in Ras GTPases and their representation as sequence logos

Read the Wennenberger et al. 2015 short review paper on the Ras GTPase superfamily J Cell Sci. 2005 Mar 1;118(Pt 5):843-6. doi:10.1242/jcs.01660..

What is the functional role of the conserved G motifs of Ras GTPases?
Use $seaview$ to explore the two alignments that we generated for the Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa set, locate the five G motifs described in the paper, and extract them as ‘regions’ along with any non-gaped columns flanking them, saving each of these regions to a file.
Use the WebLogo app to create sequence logos for each of the five G motifs and non-gaped flanking columns. Compile all five logos into a single image, properly label each panel, provide a short but informative description, and save the final figure in PDF format.

6 Profile hidden Markov models, and HMMER3

Pairwise sequence comparison methods such as BLASTP and FASTA assume that all amino acid positions are equally important. However, it is well known that different residues in a functional sequence are subject to varying selective pressures.

We have clearly observed these patterns of variable conservation across columns in the multiple sequence alignments that we have computed. Some positions are more conserved than others, and some regions seem to tolerate insertions and deletions more than other regions.

Therefore, it would be desirable to use position-specific information from multiple alignments when searching databases for homologous sequences, or aligning highly divergent sequences.

Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences Eddy, (1998), and guiding the calculation of MSAs. Profile HMMs integrate the positions-specific scores for each amino acid along with its insertion and deletion scores, and transition scores between the match (M1-MX), insertion (I0-IX) and deletion (D1-DX) states, as summarized in the Figure shown below:

Steps for converting a MSA into a profile hidden Markov model.(https://www.ebi.ac.uk/training/online/courses/pfam-creating-protein-families/what-are-profile-hidden-markov-models-hmms/ ).

HMMER is a software package that implements many programs to build and use profile hidden Markov models (profile HMMs) for different bioinformatic tasks, including searching sequence databases for remote sequence homologs, aligning multiple sequences, and scanning a sequence for the presence of certain domains modeled as HMMs (Eddy, 2011).

hmmer.org

6.1 HMMER3 tutorial overview

The HMMER v3.4 package contains 18 programs, plus a set of auxiliary tools.

The programs $hmmbuild$, $hmmsearch$, $hmmscan$, and $hmmalign$ provide the core functionality for protein domain analysis and annotation pipelines, for instance using profile databases like Pfam.

This tutorial will focus on these four programs, used for the the following tasks:

Compute a profile HMM using $hmmbuild$ from a given reference (seed) alignment
- Calibrate the new HMM using Swiss-Prot proteins to determine a suitable score cutoff
Find Rab homologs in the Acanthamoeba castellanii proteome using $hmmsearch$ with the cutoff scores determined during the calibration phase
Extract the $hmmsearch$ hits and align them to a reference profile HMM using $hmmalign$
Given a query protein, interrogate a HMM library using $hmmscan$ to determine/annotate the domain composition of the protein

6.2 Preliminaries: preparing the working directory

Save the following paths in variables for easy access to the sequences and databases required for running the exercises on the chaac server.

Navigate into the sesion3_clustal_y_hmmer directory, and from therein execute the following lines of code:

# 1. Change dir into your desired location on chaac and make a new directory for this exercise 
# i) Change directory into your working directory and make the following new directory to hold the sequence data for the exercises
if [ ! -d HMMER ]; then
    mkdir HMMER && cd HMMER
else
   cd HMMER
fi 

# 2. save the following paths in variables for easy access to the sequences and databases
base_dir=/export/space3/users/vinuesa/DBs
sp_db="${base_dir}"/Swiss-Prot
pfam_db="${base_dir}"/Pfam
euk_proteomes="${base_dir}"/euk_proteomes
sequences="${base_dir}"/sequences

# make symlinks from your working directory to the sequence data to avoid copying large files
ln -s  $euk_proteomes/*faa .
ln -s $sequences/Rab_SPhyd94.faa .

6.3 Build a Rab family-specific HMM, calibrate the model using swiss-prot reference proteins, and find all small Rab GTPases encoded in the Acanthamoeba castellanii proteome

This exercise will focus on Rab GTPases from the free-living amoeba Acanthamoeba castellanii (Rivera et al. 2024), the largest family within the superfamily of Ras GTPases (Wennenberger et al. 2005) presented in an earlier section.

The goal is to construct a Rab GTPase-specific HMM, and calibrate the model to identify all bona fide Rab GTPase homologs in a given proteome with high specificity and sensitivity.

This exercise will guide you through some of the basic steps commonly performed to generate a custom HMM to search for homologs, which typically involve:

Computing an alignment from a collection of bona fide homologs representative of the evolutionary diversity of your target family (seed sequences), in this case Rab GTPases from human, yeast, and Dictyostelium
Building an HMM from the seed alignment using $hmmbuild$
Searching a database of curated sequences like Swiss-Prot with $hmmsearch$ to define a suitable score cutoff value
Searching the proteome(s) of your interest with $hmmsearch$, using the previously defined cutoff value to select bona fide hits
Extracting the hit sequences for further downstream analyses
Adding the new homologs to the collection of seed sequences, and constructing a new HMM based on the full alignment

6.3.1 What are Rab GTPases

Rab proteins constitute the largest family within the Ras superfamily, as shown previously on the tree of the Ras superfamily, with approximately 70 members identified in humans. Rab GTPases are the primary regulators of the vesicular trafficking pathways involved in transporting the vast array of cellular cargo across membrane organelles (Stenmark, 2009). The Rabs function as molecular switches in regulating the formation, transport, docking, and fusion of transport vesicles during membrane transport, being active when bound with GTP. In order to be able to associate with organellar membranes, Rabs get prenylated (lipidated) at their C-termini. Below a scheme depicting the cellular locations of specific Rabs.

Rab GTPases as regulator of intracellular membrane trafficking; https://www.mbi.nus.edu.sg/mbinfo/what-are-rab-gtpases/

6.3.2 Align 94 bona-fide Rab GTPases from human, yeast and dicty

The Rab_SPhyd94.faa FASTA file contains Swiss-Prot entries for bona fide Rab GTPases from human, yeast and Dictyostelium.

Before proceeding with the alignment, let us get some notion about the sequences in the Rab_SPhyd94.faa FASTA file by obtaining the following information:

6.3.2.1 Sequence length distribution of proteins in the Rab_SPhyd94.faa FASTA file

We will obtain the following information: - Number of sequences in the file - Summary statistics of sequence lengths

# Number of sequences in the file 
grep -c '^>' Rab_SPhyd94.faa

echo '=='

# Summary statistics of sequence lengths; Note requires fasta_toolkit.awk v0.6
fasta_toolkit.awk -R 6 Rab_SPhyd94.faa | cut -f2 | col_sumStats.sh

94
==
Min: 194
Mean: 220.457
Median: 213.5
Mode: 201,203
Max: 417

As summarized above, and confirmed by the histogram shown below, most canonical Rab GTPases are between 200 and 250 amino acids long.

# R code to plot the length histogram with summary statistics shown below 
# https://rdrr.io/cran/Hmisc/man/Overview.html
options(Hverbose=FALSE)
library(Hmisc)

# read data
dfr <- read.table(file="Rab_SPhyd94_lengths.tsv", header = FALSE, sep = "\t")

# explore the structure of the object dfr
str(dfr)

# compute summary statistics
max_len=max(dfr$V2)
min_len=min(dfr$V2)
max_len_extended=max_len+(max_len*0.2)
avg <- round(mean(dfr$V2),1)
med <- round(median(dfr$V2),1)

# plot using the base system
titl=paste0("Length density:\nmean=", avg, "; median=", med, "; min=", min_len, "; max=", max_len)
hist(dfr$V2, breaks = "Freedman-Diaconis", xlim = c(0, max_len_extended), probability=TRUE, xlab="length", main=titl)
minor.tick(nx=2, tick.ratio=0.5) # from Hmisc package
abline(v = median(dfr$V2), col = "blue", lwd = 3)
abline(v = mean(dfr$V2), col = "red", lwd = 3)
lines(density(dfr$V2), col = 'green', lwd = 3)

Histogram for sequence lengths of 94 canonical Rab GTPases stored in Rab_SPhyd94.faa

6.3.2.2 Confirmation that the reference Rab sequences contain at least one cysteine among the five C-terminar residues, required as prenylation sites

As a final check of our reference Rab GTPases, let us extract the final five C-terminal residues of each sequence to verify that this region contains at least one cysteine residue, used as prenylation (lipidation) site, required by Rab proteins to insert into the membranes of the vesicles that they associate with to control their intracellular trafficking.

We will make use of the ad hoc Perl script extract_N-or_C-terminal_regions_and_compute_counts.pl designed for this task. The script will extract the indicated number of C-terminal residues, count the instances of a user-specified amino acid in this region, and compute its over-all amino acid frequencies.

# print the script's help
extract_N-or_C-terminal_regions_and_compute_counts.pl

Usage for extract_N-or_C-terminal_regions_and_compute_counts.pl v0.1_2024-11-01:

extract_N-or_C-terminal_regions_and_compute_counts.pl --fasta  --output  --residues  --region  --count_residue 

Options:
    --fasta|-f          Input multi-FASTA file of protein sequences
    --output|-o         Output file prefix (for generated FASTA and tabular files)
    --residues|-n       Number of N-terminal or C-terminal residues to extract
    --region|-r         Region to extract: 'N' for N-terminal or 'C' for C-terminal
    --count_residue|-a  Residue to count in the extracted region (e.g., M for methionine, C for cysteine)
    --help              Show this help message

AIM:
   Extracts a specified number of N- or C-terminal residues from each protein sequence in a multi-FASTA file,
     computes basic overall residue frequencies in the extracted region, and counts instances of a focal amino acid. 
   This is useful, for example, to verify and analyze the distribution of cysteines found among
     the five C-terminal residues of Rab GTPases

OUTPUT:
    multi-FASTA with the extracted N- or C-terminal regions
    Three-col table with the seq_id, sequence, counts for user-defined residue in the extracted sequence
    Matrix with overall proportions of each amino acid at each site of the extracted sequences

# Extract the five C-terminal residues of each sequence, and count the number of cysteines (C's) in that region
extract_N-or_C-terminal_regions_and_compute_counts.pl -f Rab_SPhyd94.faa -o Rab_SPhyd94_5CtCs -n 5 -r C -a C

# explore the header of the resulting Rab_SPhyd94_5CtCs_Cterm.tsv file
cat Rab_SPhyd94_5CtCs_Cterm.tsv | head -5

# print any sequences do not have at least one C among its five terminal residues on its C-terminus
awk -F "\t" '$3 == 0' Rab_SPhyd94_5CtCs_Cterm.tsv

sp|P51159|RB27A_HUMAN   GACGC   2
sp|Q9H082|RB33B_HUMAN   MTCWC   2
sp|P20336|RAB3A_HUMAN   QDCAC   2
sp|P20337|RAB3B_HUMAN   QNCSC   2
sp|Q96E17|RAB3C_HUMAN   PNCAC   2

As expected, all sequences have at least one cysteine among their five C-terminal residues.

We can use the Rab_SPhyd94_5CtCs_Cterm.fasta file generated by the script to generate a sequence logo with WebLogo3, as shown below:

Sequence logo for the five C-terminal residues of 94 canonical Rab GTPases stored in Rab_SPhyd94.faa

As shown by the sequence logo above, cysteines are the most prevalent amino acids among the last three residues at the C-termini of canonical Rab GTPases.

6.3.2.3 Computing a multiple sequence alignment of the Rab reference sequences using $clustal\Omega$.

After having verified that we have bona fide Rab GTPases in our FASTA file, let us align the sequences using $clustal\Omega$ with the following command:

clustalo -i Rab_SPhyd94.faa -o Rab_SPhyd94_cluoAln.afa --threads 4

6.3.3 Build the profile HMM with $hmmbuild$

We will use the multiple sequence alignment of reference Rab GTPases computed in the previous step to build a profile HMM from this seed alignment with $hmmbuild$. The following line generates the Rab.hmm profile hidden Markov model from the input alignment Rab_SPhyd94_cluoAln.afa

hmmbuild -h
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Usage: hmmbuild [-options] <hmmfile_out> <msafile>
# ... truncated
hmmbuild Rab.hmm Rab_SPhyd94_cluoAln.afa

6.3.3.1 Exploring HMMER’s profile model files

Let us inspect the Rab.hmm file generated by $hmmbuild$:

head -25 Rab.hmm

HMMER3/f [3.4 | Aug 2023]
NAME  Rab_SPhyd94_cluoAln
LENG  344
ALPH  amino
RF    no
MM    no
CONS  yes
CS    no
MAP   yes
DATE  Sun Oct 27 18:43:28 2024
NSEQ  94
EFFN  3.892761
CKSUM 2257599252
STATS LOCAL MSV      -11.4380  0.70024
STATS LOCAL VITERBI  -12.1821  0.70024
STATS LOCAL FORWARD   -5.3353  0.70024
HMM          A        C        D        E        F        G        H        I        K        L        M        N        P        Q        R        S        T        V        W        Y   
            m->m     m->i     m->d     i->m     i->i     d->m     d->d
  COMPO   2.50654  4.17120  2.93335  2.68440  3.33488  2.88766  3.80488  2.86641  2.56945  2.55827  3.70354  3.03682  3.48961  3.03847  2.92385  2.53989  2.73769  2.67286  4.52173  3.54948
          2.68618  4.42225  2.77520  2.73124  3.46354  2.40513  3.72495  3.29354  2.67741  2.69355  4.24649  2.90347  2.73740  3.18147  2.89801  2.37887  2.77520  2.98519  4.58477  3.61503
          0.10610  2.44147  4.29506  0.48460  0.95697  0.00000        *
      1   2.72646  4.32781  3.85243  3.37720  3.17928  3.69271  4.20027  2.11813  3.17844  1.78667  2.53735  3.68219  4.14144  3.55450  3.43575  3.07210  3.02843  1.90272  5.00825  3.74883      2 l - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.04261  3.57272  4.29506  0.61958  0.77255  0.48576  0.95510
      2   2.36474  4.18722  3.26296  2.91549  3.49111  3.14356  3.88857  2.38634  2.80576  2.55101  3.61050  3.21077  3.75399  3.21637  3.09572  2.16970  2.74516  2.43597  5.02404  3.68037      3 s - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.04261  3.57272  4.29506  0.61958  0.77255  0.48576  0.95510
... truncated
    343   2.81986  4.34111  3.91257  3.53906  3.21715  3.73397  4.32616  1.48544  3.35834  1.89221  3.22770  3.81877  4.21260  3.74827  3.58392  3.22433  3.13444  1.91249  5.03377  3.73193    609 i - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.04086  3.61383  4.33618  0.61958  0.77255  0.58636  0.81272
    344   2.22260  4.12679  3.00636  2.82925  3.81081  2.86240  3.93575  3.30649  2.88075  3.05674  4.02473  3.08679  3.58300  3.25901  3.16161  1.57168  2.68606  2.90530  5.21115  3.89650    610 s - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.02769  3.60066        *  0.61958  0.77255  0.00000        *
//

Each profile HMM file starts with a format version identifier (here, HMMER3/f), and ends with // on a line by itself, which allows multiple profiles to be concatenated.

The format is divided into two regions: 1. The first region contains textual information and miscellaneous parameters in a roughly tag-value scheme. This section ends with a line beginning with the key-word HMM. 2. The second region is a tabular, white space-limited format for the main model parameters.

All probability parameters are all stored as negative natural log probabilities with five digits of precision to the right of the decimal point, rounded. For example, a probability of 0.25 is stored as −log 0.25 = 1.38629. The special case of a zero probability is stored as ’*’.

Main model section

The first two lines in the main model section are atypical. They contain information for the core model’s BEGIN node.

The remainder of the model has three lines per node, for M nodes (where M is the number of match states, as given by the LENG line). These three lines are (K is the alphabet size in residues): * Match emission line * Insert emission line * State transition line, where m = match, d = delete, i = insert.

Finally, the last line of the format is the “//” record separator.

For the details, consult the HMMER3.4 User guide.

6.3.4 HMM calibration

After computing the HMM for the Rab family from the seed alignment, we need to calibrate the model searching for bona fide homologs in the manually-curated Swiss-Prot database to define a proper score cutoff value.

We will use $hmmsearch$ for this search task, using our Rab.hmm model to interrogate the Swiss-Prot database of manually curated proteins.

hmmsearch -h
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institu6.6e-47  165.5te.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Usage: hmmsearch [options] <hmmfile> <seqdb>

# We will use the following options; use hmmsearch -h to get the full documentation 
# --tblout <f>     : save parseable table of per-sequence hits to file <f>
# --noali          : don't output alignments, so output is smaller
# --notextw        : unlimit ASCII text output line width
# ... truncated

hmmsearch --tblout sprot_Rab_hits_hmmsearch.out --noali --notextw Rab.hmm "${sp_db}"/uniprot_sprot.fasta

Explore with $head$ and $tail$ the structure of the output file sprot_Rab_hits_hmmsearch.out generated by $hmmsearch$:

head sprot_Rab_hits_hmmsearch.out && echo '==' && tail -15 sprot_Rab_hits_hmmsearch.out

#                                                                --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name         accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
# ------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
sp|Q8WXH6|RB40A_HUMAN -          Rab_SPhyd94_cluoAln  -           8.9e-110  372.1   0.0    1e-109  371.9   0.0   1.0   1   0   0   1   1   1   1 Ras-related protein Rab-40A OS=Homo sapiens GN=RAB40A PE=2 SV=2
sp|P0C0E4|RB40L_HUMAN -          Rab_SPhyd94_cluoAln  -           1.4e-109  371.5   0.0  1.5e-109  371.3   0.0   1.0   1   0   0   1   1   1   1 Ras-related protein Rab-40A-like OS=Homo sapiens GN=RAB40AL PE=1 SV=1
sp|Q12829|RB40B_HUMAN -          Rab_SPhyd94_cluoAln  -           1.1e-108  368.5   0.0  1.3e-108  368.3   0.0   1.0   1   0   0   1   1   1   1 Ras-related protein Rab-40B OS=Homo sapiens GN=RAB40B PE=2 SV=1
sp|Q8VHQ4|RB40C_MOUSE -          Rab_SPhyd94_cluoAln  -           9.9e-105  355.5   0.0  1.1e-104  355.4   0.0   1.0   1   0   0   1   1   1   1 Ras-related protein Rab-40C OS=Mus musculus GN=Rab40c PE=2 SV=1
sp|Q96S21|RB40C_HUMAN -          Rab_SPhyd94_cluoAln  -           1.1e-104  355.4   0.0  1.2e-104  355.3   0.0   1.0   1   0   0   1   1   1   1 Ras-related protein Rab-40C OS=Homo sapiens GN=RAB40C PE=1 SV=1
sp|Q8VHP8|RB40B_MOUSE -          Rab_SPhyd94_cluoAln  -           5.4e-104  353.1   0.0    6e-104  352.9   0.0   1.0   1   0   0   1   1   1   1 Ras-related protein Rab-40B OS=Mus musculus GN=Rab40b PE=2 SV=1
==
sp|B2SYV2|OBG_BURPP   -          Rab_SPhyd94_cluoAln  -                8.1   10.6   0.0        15    9.6   0.0   1.4   2   0   0   2   2   1   0 GTPase Obg OS=Burkholderia phytofirmans (strain DSM 17436 / PsJN) GN=obg PE=3 SV=1
sp|Q2WBH0|MNME_MAGSA  -          Rab_SPhyd94_cluoAln  -                8.5   10.5   0.0   3.5e+02    5.2   0.0   2.1   2   0   0   2   2   2   0 tRNA modification GTPase MnmE OS=Magnetospirillum magneticum (strain AMB-1 / ATCC 700264) GN=mnmE PE=3 SV=2
sp|P27584|GPA1_SCHPO  -          Rab_SPhyd94_cluoAln  -                8.6   10.5   1.3        11   10.1   0.4   1.6   2   0   0   2   2   1   0 Guanine nucleotide-binding protein alpha-1 subunit OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) GN=gpa1 PE=3 SV=4
sp|Q9KA77|IF2_BACHD   -          Rab_SPhyd94_cluoAln  -                8.7   10.4   0.0       8.7   10.4   0.0   2.2   3   0   0   3   3   1   0 Translation initiation factor IF-2 OS=Bacillus halodurans (strain ATCC BAA-125 / DSM 18197 / FERM 7344 / JCM 9153 / C-125) GN=infB PE=3 SV=1
sp|O67679|ENGB_AQUAE  -          Rab_SPhyd94_cluoAln  -                9.2   10.4   3.3        11   10.0   0.2   2.2   3   0   0   3   3   1   0 Probable GTP-binding protein EngB OS=Aquifex aeolicus (strain VF5) GN=engB PE=3 SV=1
#
# Program:         hmmsearch
# Version:         3.4 (Aug 2023)
# Pipeline mode:   SEARCH
# Query file:      Rab.hmm
# Target file:     uniprot_sprot.fasta
# Option settings: hmmsearch --tblout sprot_Rab_hits_hmmsearch.out --noali --notextw Rab.hmm uniprot_sprot.fasta 
# Current dir:     /home/vinuesa/Cursos/LCG/ENS_y_PANGENOMICA/tema10_alineamientos_multiples/practicas/HMMER
# Date:            Thu Oct 17 10:26:25 2024
# [ok]

It resembles the BLAST tabular output. The first column, the “target name”, contains the accession numbers of the hits, and the last one presents its full description.

The tabular output generated by the hmmsearch –tblout call contains 19 fields, which can be conveniently extracted and numbered with the following one-liner:

sed -n '2p' sprot_Rab_hits_hmmsearch.out | sed 's/# //; s/ name//g; s/description of target/description_of_target/' | perl -pe 's/\h+/\n/g' | nl

     1  target
     2  accession
     3  query
     4  accession
     5  E-value
     6  score
     7  bias
     8  E-value
     9  score
    10  bias
    11  exp
    12  reg
    13  clu
    14  ov
    15  env
    16  dom
    17  rep
    18  inc
    19  description_of_target

From a statistical point of view, we should examine the columns grouped under full sequence, best 1 domain, and domain number estimation sections.

The most relevant section for our purpose is that listed under full sequence, which integrates the scores and E-values of the multiple domains that a hit may have. Below is a list explaining the meaning of each column in each of these sections.

full sequence
- E-value, The expectation value (statistical significance) of the target. This is a per query E-value
- score, The score (in bits) for this target/query comparison. It includes the biased-composition correction
- bias, The score (in bits) for this target/query comparison. It includes the biased-composition correction
The best 1 domain provides statistics only for the domain with the lowest E-value and highest scores
- E-value, The E-value if only the single best-scoring domain envelope were found in the sequence, and none of the others
- score, The bit score if only the single best-scoring domain envelope were found in the sequence, and none of the others
- bias, The null2 bias correction that was applied to the bit score of the single best-scoring domain
The domain number estimation provides information on the expected (exp)
- exp, Expected number of domains, as calculated by posterior decoding on the mean number of begin states used in the alignment ensemble
- reg, Number of discrete regions defined, as calculated by heuristics applied to posterior decoding of begin/end state positions in the alignment ensemble
- clu, Number of regions that appeared to be multidomain, and therefore were passed to stochastic traceback clustering for further resolution down to one or more envelopes. This number is often zero
- ov, For envelopes that were defined by stochastic traceback clustering, how many of them overlap other envelopes
- env, The total number of envelopes defined, both by single envelope regions and by stochastic traceback clustering into one or more envelopes per region
- dom, Number of domains defined. In general, this is the same as the number of envelopes: for each envelope, we find an MEA (maximum expected accuracy) alignment, which defines the end-points of the alignable domain
- rep, Number of domains satisfying reporting thresholds. If you’ve also saved a –domtblout file, there will be one line in it for each reported domain
- inc, Number of domains satisfying inclusion thresholds

The hits are sorted by increasing E-value of the full sequence, which correlates with decreasing score.

Note that E-values are dependent on the size of the database, while scores are not! Therefore, it is preferable to use scores as model cutoffs.

To efficiently explore the long sprot_Rab_hits_hmmsearch.out file, we will $grep$ out the lines containing the human, dicty and Arabidopsis annotations, and simplify the output with some $AWK$ code.

# identify the highest alignment and domain E-values and scores to be used as cutoffs,
#   based on the human, dicty, and Arabidopsis Rab swiss-prot annotations
grep sapiens sprot_Rab_hits_hmmsearch.out | awk '{print $1,$5,$6,$19,$20,$21,$22,$23,$24,$25}' | grep -E 'Rab|RAB' | column -t # major score drop 165.8 -> 77.9; score 165
grep Arabidopsis sprot_Rab_hits_hmmsearch.out |  awk '{print $1,$5,$6,$19,$20,$21,$22,$23,$24,$25}' | grep -E 'Rab|RAB' | column -t # 1.7e-50  177.2
grep Dictyostelium sprot_Rab_hits_hmmsearch.out |  awk '{print $1,$5,$6,$19,$20,$21,$22,$23,$24,$25}' | grep -E 'Rab|RAB' | column -t # 4.9e-45  score 159

# grep sapiens sprot_Rab_hits_hmmsearch.out | awk '{print $1,$5,$6,$19,$20,$21,$22,$23,$24,$25}' | grep -E 'Rab|RAB' | column -t | tail
sp|Q969Q5|RAB24_HUMAN  1.7e-65   226.5  Ras-related  protein      Rab-24        OS=Homo  sapiens  GN=RAB24    PE=1
sp|P51157|RAB28_HUMAN  5e-65     225.0  Ras-related  protein      Rab-28        OS=Homo  sapiens  GN=RAB28    PE=1
sp|Q7Z6P3|RAB44_HUMAN  2.2e-61   213.0  Ras-related  protein      Rab-44        OS=Homo  sapiens  GN=RAB44    PE=3
sp|Q9NX57|RAB20_HUMAN  6.2e-53   185.2  Ras-related  protein      Rab-20        OS=Homo  sapiens  GN=RAB20    PE=1
------------------------------------------------------------------------------------------------------------------------ STRICT CUTOFF, ONLY CANONICAL RABs
sp|Q9UBK7|RBL2A_HUMAN  1.4e-47   167.6  Rab-like     protein      2A            OS=Homo  sapiens  GN=RABL2A   PE=2
sp|Q9UNT1|RBL2B_HUMAN  4.8e-47   165.8  Rab-like     protein      2B            OS=Homo  sapiens  GN=RABL2B   PE=2
------------------------------------------------------------------------------------------------------------------------ RELAXED CUTOFF, including RAB-like proteins, before large score drop.
sp|Q8N4Z0|RAB42_HUMAN  2.7e-20   77.9   Putative     Ras-related  protein       Rab-42   OS=Homo  sapiens     GN=RAB42
sp|Q5HYI8|RABL3_HUMAN  3.3e-16   64.5   Rab-like     protein      3             OS=Homo  sapiens  GN=RABL3    PE=1
sp|Q3YEC7|RABL6_HUMAN  2.8e-11   48.3   Rab-like     protein      6             OS=Homo  sapiens  GN=RABL6    PE=1
sp|Q9BU20|RSG1_HUMAN   7.7e-07   33.6   REM2-        and          Rab-like      small    GTPase   1           OS=Homo

Based on the output of the $grep$ command on the human hits shown above, a reasonably relaxed sequence score to find most Rab and Rab-like homologs, without including too many false positives, would lie around 165.

However, the Arabidopsis $hmmsearch$ results shown below indicate that using a cutoff below 177.2 will retrieve Ran proteins, which are nuclear GTPases involved in nucleocytoplasmic transport, participating both to the import and the export from the nucleus of proteins and RNAs. They belong to the Ras superfamily, but are not Rab GTPases, as shown on the Ras superfamily tree.

# grep Arabidopsis sprot_Rab_hits_hmmsearch.out | awk '{print $1,$5,$6,$19,$20,$21,$22,$23,$24,$25}' | column -t 
... truncated
sp|Q9C820|RAG3D_ARATH  1e-73    253.5  Ras-related       protein             RABG3d       OS=Arabidopsis  thaliana        GN=RABG3D       PE=2
sp|Q9SJ11|RABG2_ARATH  3.9e-73  251.6  Ras-related       protein             RABG2        OS=Arabidopsis  thaliana        GN=RABG2        PE=2
sp|Q9SF92|RAC2B_ARATH  5.4e-72  247.9  Ras-related       protein             RABC2b       OS=Arabidopsis  thaliana        GN=RABC2B       PE=2
sp|Q9LW76|RAG3C_ARATH  5.9e-72  247.7  Ras-related       protein             RABG3c       OS=Arabidopsis  thaliana        GN=RABG3C       PE=2
sp|O04157|RAG3B_ARATH  8.4e-72  247.2  Ras-related       protein             RABG3b       OS=Arabidopsis  thaliana        GN=RABG3B       PE=1
sp|Q9LV79|RAH1A_ARATH  1.1e-71  246.9  Ras-related       protein             RABH1a       OS=Arabidopsis  thaliana        GN=RABH1A       PE=2
sp|Q948K8|RAG3A_ARATH  1.5e-71  246.5  Ras-related       protein             RABG3a       OS=Arabidopsis  thaliana        GN=RABG3A       PE=2
sp|Q9XI98|RAG3E_ARATH  2e-71    246.0  Ras-related       protein             RABG3e       OS=Arabidopsis  thaliana        GN=RABG3E       PE=2
sp|Q948K6|RABG1_ARATH  1.7e-65  226.5  Ras-related       protein             RABG1        OS=Arabidopsis  thaliana        GN=RABG1        PE=2
----------------------------------------------------------------------------------------------------------------------------------------------------- Strong score drop
sp|Q9SJZ5|RAA4E_ARATH  1.7e-50  177.2  Putative          Ras-related         protein      RABA4e          OS=Arabidopsis  thaliana        GN=RABA4E
----------------------------------------------------------------------------------------------------------------------------------------------------- Ran and Rac hits if score < 163
sp|Q9FLQ3|RAN4_ARATH   5e-46    162.5  GTP-binding       nuclear             protein      Ran-4           OS=Arabidopsis  thaliana        GN=RAN4
sp|P41916|RAN1_ARATH   7.8e-46  161.9  GTP-binding       nuclear             protein      Ran-1           OS=Arabidopsis  thaliana        GN=RAN1
sp|P41917|RAN2_ARATH   1.2e-45  161.3  GTP-binding       nuclear             protein      Ran-2           OS=Arabidopsis  thaliana        GN=RAN2
sp|Q8H156|RAN3_ARATH   1.4e-45  161.0  GTP-binding       nuclear             protein      Ran-3           OS=Arabidopsis  thaliana        GN=RAN3
sp|Q9XGU0|RAC9_ARATH   3.7e-45  159.6  Rac-like          GTP-binding         protein      ARAC9           OS=Arabidopsis  thaliana        GN=ARAC9
sp|Q38912|RAC3_ARATH   5.2e-45  159.1  Rac-like          GTP-binding         protein      ARAC3           OS=Arabidopsis  thaliana        GN=ARAC3
sp|O82480|RAC7_ARATH   6.4e-45  158.9  Rac-like          GTP-binding         protein      ARAC7           OS=Arabidopsis  thaliana        GN=ARAC7
sp|P92978|RAC11_ARATH  7.6e-45  158.6  Rac-like          GTP-binding         protein      ARAC11          OS=Arabidopsis  thaliana        GN=ARAC11
... truncated

Careful examination of the human and Arabidopsis outputs shown above, suggests that a score cutoff >= 185 would retrieve bona fide Rab GTPases, excluding the non-canonical, putative Rab-like proteins, and non-Rab members of the Ras superfamily.

We will evaluate the robustness and suitability of the >= 185 score cutoff by analyzing also the Dictyostelium and yeast output files:

# grep Dictyostelium sprot_Rab_hits_hmmsearch.out | awk '{print $1,$5,$6,$19,$20,$21,$22,$23,$24,$25}' | column -t
... truncated 
sp|Q54SV1|RABY_DICDI   7.7e-55  191.5  Ras-related       protein                   RabY        OS=Dictyostelium  discoideum        GN=rabY           PE=3
sp|Q55EF0|RABT2_DICDI  7.8e-55  191.5  Ras-related       protein                   RabT2       OS=Dictyostelium  discoideum        GN=rabT2          PE=3
sp|Q86JC8|RABH_DICDI   2.9e-54  189.6  Ras-related       protein                   RabH        OS=Dictyostelium  discoideum        GN=rabH           PE=3
sp|Q55EG6|RABT1_DICDI  4.4e-54  189.0  Ras-related       protein                   RabT1       OS=Dictyostelium  discoideum        GN=rabT1          PE=3
-----------------------------------------------------------------------------------------------------------------------------------------------------------/ score cutoff >= 185
sp|P32254|RASS_DICDI   1.8e-52  183.7  Ras-like          protein                   rasS        OS=Dictyostelium  discoideum        GN=rasS           PE=2
sp|Q54FK7|RABL_DICDI   6.1e-51  178.7  Ras-related       protein                   RabL        OS=Dictyostelium  discoideum        GN=rabL           PE=3
sp|P15064|RASG_DICDI   1.4e-49  174.2  Ras-like          protein                   rasG        OS=Dictyostelium  discoideum        GN=rasG           PE=1
sp|P32252|RASB_DICDI   3e-49    173.1  Ras-like          protein                   rasB        OS=Dictyostelium  discoideum        GN=rasB           PE=1
sp|P18613|RAPA_DICDI   3.9e-49  172.7  Ras-related       protein                   rapA        OS=Dictyostelium  discoideum        GN=rapA           PE=1
sp|P03967|RASD_DICDI   9e-49    171.5  Ras-like          protein                   rasD        OS=Dictyostelium  discoideum        GN=rasD           PE=2
sp|P32253|RASC_DICDI   1.1e-48  171.3  Ras-like          protein                   rasC        OS=Dictyostelium  discoideum        GN=rasC           PE=2
sp|Q55CB7|RASY_DICDI   1.2e-48  171.1  Ras-like          protein                   rasY        OS=Dictyostelium  discoideum        GN=rasY           PE=3
sp|Q55BW0|RAPC_DICDI   1.8e-48  170.5  Ras-related       protein                   rapC        OS=Dictyostelium  discoideum        GN=rapC           PE=3
sp|Q54FK5|RABN1_DICDI  9.5e-48  168.1  Ras-related       protein                   RabN1       OS=Dictyostelium  discoideum        GN=rabN1          PE=3
sp|Q55CC0|RASW_DICDI   1.1e-47  167.9  Ras-like          protein                   rasW        OS=Dictyostelium  discoideum        GN=rasW           PE=3
sp|Q55CB8|RASX_DICDI   1.5e-47  167.5  Ras-like          protein                   rasX        OS=Dictyostelium  discoideum        GN=rasX           PE=3
sp|Q55CA9|RASZ_DICDI   6.4e-47  165.4  Ras-like          protein                   rasZ        OS=Dictyostelium  discoideum        GN=rasZ           PE=3
sp|Q55CB0|RASU_DICDI   1.7e-46  164.0  Ras-like          protein                   rasU        OS=Dictyostelium  discoideum        GN=rasU           PE=3
sp|Q54FG4|RABP_DICDI   2.8e-46  163.3  Ras-related       protein                   RabP        OS=Dictyostelium  discoideum        GN=rabP           PE=3
sp|Q55CB9|RASV_DICDI   1.3e-45  161.1  Ras-like          protein                   rasV        OS=Dictyostelium  discoideum        GN=rasV           PE=3
sp|P33519|RAN_DICDI    1.3e-45  161.1  GTP-binding       nuclear                   protein     Ran               OS=Dictyostelium  discoideum        GN=ranA
... truncated

below the Sacharomyces $hmmsearch$ output. Note that the yeast Rab GTPases are labeled as YPT* proteins:

# grep Saccharom sprot_Rab_hits_hmmsearch.out |  awk '{print $1,$5,$6,$19,$20,$21,$22,$23,$24,$25}' | column -t
sp|P01123|YPT1_YEAST   3.1e-93  317.7  GTP-binding           protein             YPT1                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P38555|YPT31_YEAST  1e-87    299.6  GTP-binding           protein             YPT31/YPT8          OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P51996|YPT32_YEAST  7.3e-87  296.8  GTP-binding           protein             YPT32/YPT11         OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P07560|SEC4_YEAST   3e-86    294.8  Ras-related           protein             SEC4                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|Q99260|YPT6_YEAST   6e-81    277.3  GTP-binding           protein             YPT6                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P36017|VPS21_YEAST  2e-79    272.3  Vacuolar              protein             sorting-associated  protein           21                OS=Saccharomyces  cerevisiae
sp|P32939|YPT7_YEAST   5.4e-75  257.7  GTP-binding           protein             YPT7                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P36019|YPT53_YEAST  3.2e-74  255.2  GTP-binding           protein             YPT53               OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P36018|YPT52_YEAST  7.6e-74  254.0  GTP-binding           protein             YPT52               OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P38146|YPT10_YEAST  1.6e-59  206.9  GTP-binding           protein             YPT10               OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|A6ZSH6|YPT11_YEAS7  9.6e-56  194.4  GTP-binding           protein             YPT11               OS=Saccharomyces  cerevisiae        (strain           YJM789)
sp|B5VQB6|YPT11_YEAS6  1.1e-55  194.3  GTP-binding           protein             YPT11               OS=Saccharomyces  cerevisiae        (strain           AWRI1631)
sp|P48559|YPT11_YEAST  1.1e-55  194.3  GTP-binding           protein             YPT11               OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|B3LPD8|YPT11_YEAS1  1.1e-55  194.2  GTP-binding           protein             YPT11               OS=Saccharomyces  cerevisiae        (strain           RM11-1a)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------- / score 185 cutoff; note also large score drop
sp|P01119|RAS1_YEAST   1.9e-48  170.5  Ras-like              protein             1                   OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P32836|GSP2_YEAST   3.6e-48  169.6  GTP-binding           nuclear             protein             GSP2/CNR2         OS=Saccharomyces  cerevisiae        (strain
sp|P32835|GSP1_YEAST   4.6e-48  169.2  GTP-binding           nuclear             protein             GSP1/CNR1         OS=Saccharomyces  cerevisiae        (strain
sp|P01120|RAS2_YEAST   1.1e-47  167.9  Ras-like              protein             2                   OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P13856|RSR1_YEAST   5e-46    162.5  Ras-related           protein             RSR1                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|Q00246|RHO4_YEAST   2e-41    147.4  GTP-binding           protein             RHO4                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P06780|RHO1_YEAST   3.2e-41  146.7  GTP-binding           protein             RHO1                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|Q00245|RHO3_YEAST   1.1e-39  141.7  GTP-binding           protein             RHO3                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P06781|RHO2_YEAST   1.2e-37  135.0  GTP-binding           protein             RHO2                OS=Saccharomyces  cerevisiae        (strain           ATCC
sp|P19073|CDC42_YEAST  7.9e-33  119.1  Cell                  division            control             protein           42                OS=Saccharomyces  cerevisiae
... truncated

In conclusion, the score cutoff >= 185 consistently identifies canonical small Rab GTPases across highly divergent eukaryotic lineages such as Opisthokonta (which includes animals like Homo and fungi like Saccharomyces), Amoebozoa (which includes the free-living amoeba Dictyostelium), and Archaeplastida (which includes plants like Arabidopsis). These results demonstrate that careful HMM calibration ensures accurate identification of bona fide homologs across distantly related groups of organisms.

Note that despite the fact that no plant sequences were included in the seed alignment of Rab GTPases used to build the Rab.hmm model, the same cutoff consistently identifies the canonical Rab GTPases in the well annotated Arabidopsis thaliana proteome. You can confirm this by searching the results for other organisms with well-characterized proteomes, like Rattus or Bos.

The following figure from (Burki et al. 2020; The new tree of Eukaryotes. TREE.) shows the phylogenetic position of the three supergroups Opisthokonta, Amoebozoa, and Archaeplastida.

Burki et al. 2020. The new tree of Eukaryotes. TREE Volume 35, Issue 1, p43-55.

6.3.5 Identify Rab GTPases in the Acanthamoeba castellanii proteome with $hmmsearch$ using the Rab.hmm and empirically determined score cutoff

In the previous section we concluded that a score cutoff >= 185 should retrieve bona fide Rab GTPases in a $hmmsearch$ using our Rab.hmm model.

The line below shows how to display the $hmmsearch$ help, and how to set the score threshold for the output file with $-T$ in a search for canonical Rab GTPases in the poorly characterized proteome of the amoebozan Acanthamoeba castellanii. Its proteome is saved in the ACACA.faa file.

# hmmsearch -h
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Usage: hmmsearch [options] <hmmfile> <seqdb>
# ... truncated
#Options controlling reporting thresholds:
#  -E <x>     : report sequences <= this E-value threshold in output  [10.0]  (x>0)
#  -T <x>     : report sequences >= this score threshold in output
#  --domE <x> : report domains <= this E-value threshold in output  [10.0]  (x>0)
#  --domT <x> : report domains >= this score cutoff in output
# ... truncated
hmmsearch --tblout ACACA_Rab_hits_hmmsearch.out --noali --notextw -T 185 Rab.hmm ACACA.faa

6.3.5.1 Count the $hmmsearch$ hits, and explore their sequence E-values and scores

grep '^tr' ACACA_Rab_hits_hmmsearch.out | wc -l ## 58
grep '^tr' ACACA_Rab_hits_hmmsearch.out | awk '{print $1,$5,$6,$19,$20,$21,$22,$23,$24,$25}' | column -t

tr|L8HD31|L8HD31_ACACA  7.1e-90  301.5  Rasrelated        protein      ORAB-1,          putative         OS=Acanthamoeba  castellanii      str.
tr|L8GSP6|L8GSP6_ACACA  1.1e-88  297.7  Ras-related       protein      Rab-2A,          putative         OS=Acanthamoeba  castellanii      str.
tr|L8HJ43|L8HJ43_ACACA  2.8e-88  296.3  RAB11B            protein,     putative         OS=Acanthamoeba  castellanii      str.             Neff
tr|L8H946|L8H946_ACACA  7.5e-84  281.7  Ras-related       protein      Rab-14           OS=Acanthamoeba  castellanii      str.             Neff
tr|L8H2N2|L8H2N2_ACACA  1.6e-82  277.3  Rasrelated        protein      Rab-2B,          putative         OS=Acanthamoeba  castellanii      str.
tr|L8H896|L8H896_ACACA  2.9e-81  273.2  RAB1B,            member       RAS              oncogene         family           OS=Acanthamoeba  castellanii
tr|L8GH28|L8GH28_ACACA  5.9e-81  272.2  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GYY7|L8GYY7_ACACA  9.8e-81  271.5  Rasrelated        protein      Rab-18-like,     putative         OS=Acanthamoeba  castellanii      str.
tr|L8HGZ3|L8HGZ3_ACACA  1.4e-80  271.0  Rab8/RabEfamily   small        GTPase,          putative         OS=Acanthamoeba  castellanii      str.
tr|L8HAR7|L8HAR7_ACACA  3.4e-80  269.7  Rab               GTPase,      putative         OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GIC5|L8GIC5_ACACA  5.7e-80  269.0  Ras               family       protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GX62|L8GX62_ACACA  1.7e-78  264.1  Ras               family       GTP-binding      protein          YPT1             OS=Acanthamoeba  castellanii
tr|L8GN15|L8GN15_ACACA  2.4e-78  263.6  GTPbinding        protein      OS=Acanthamoeba  castellanii      str.             Neff             OX=1257118
tr|L8GPH2|L8GPH2_ACACA  3.6e-78  263.0  Rab7/RabGfamily   small        GTPase           OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GK29|L8GK29_ACACA  8e-78    261.9  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8HIJ7|L8HIJ7_ACACA  4.1e-77  259.6  Ras-related       protein      Rab-2-B          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8HCK2|L8HCK2_ACACA  8.9e-77  258.5  RAB4,             putative     OS=Acanthamoeba  castellanii      str.             Neff             OX=1257118
tr|L8HGB4|L8HGB4_ACACA  3.9e-75  253.1  RabE              family       small            GTPase           (Fragment)       OS=Acanthamoeba  castellanii
tr|L8HAC7|L8HAC7_ACACA  1.9e-73  247.6  Ras-related       protein      Rab              OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GYQ2|L8GYQ2_ACACA  2.3e-73  247.3  Raslike           GTP-binding  protein          YPT1,            putative         OS=Acanthamoeba  castellanii
tr|L8GXM9|L8GXM9_ACACA  4.4e-72  243.0  GTPbinding        protein      YPTC5,           putative         OS=Acanthamoeba  castellanii      str.
tr|L8GU59|L8GU59_ACACA  7.5e-72  242.3  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GKG8|L8GKG8_ACACA  7.9e-72  242.2  Rasrelated        protein      Rab5,            putative         OS=Acanthamoeba  castellanii      str.
... truncated
tr|L8GLA9|L8GLA9_ACACA  5.5e-62  209.8  Likely            rab          family           GTPbinding       protein          OS=Acanthamoeba  castellanii
tr|L8H5G9|L8H5G9_ACACA  9.1e-62  209.1  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8H0T2|L8H0T2_ACACA  1.1e-59  202.2  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GR44|L8GR44_ACACA  1.2e-59  202.2  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GX96|L8GX96_ACACA  2.9e-59  200.9  Ras-related       protein      Rab              OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GU85|L8GU85_ACACA  8.9e-59  199.3  Ras               family       protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8HD72|L8HD72_ACACA  1.3e-58  198.7  Small             rabrelated   GTPase           OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GYC3|L8GYC3_ACACA  1.2e-56  192.3  Rab1,             putative     OS=Acanthamoeba  castellanii      str.             Neff             OX=1257118
tr|L8HER3|L8HER3_ACACA  1.5e-56  192.0  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8GII8|L8GII8_ACACA  4.8e-56  190.3  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8HB60|L8HB60_ACACA  6.7e-56  189.8  PH                domain       containing       protein          OS=Acanthamoeba  castellanii      str.
tr|L8GVE2|L8GVE2_ACACA  8.3e-56  189.5  Ras               subfamily    protein          OS=Acanthamoeba  castellanii      str.             Neff
tr|L8H1Y4|L8H1Y4_ACACA  3.2e-55  187.6  Rabtype           small        GTP-binding      protein          OS=Acanthamoeba  castellanii      str.
tr|L8HGU8|L8HGU8_ACACA  8.7e-55  186.2  Ras               family       protein          OS=Acanthamoeba  castellanii      str.             Neff

The output displayed above reveals the much poorer quality, and high inconsistency, of the protein annotations of the A. castellanii proteome, which reflects the fact that none of these proteins is yet in Swiss-Prot. However, according to our calibration, these 58 proteins are reasonable candidates for being bona fide members of the Rab GTPase family.

However, we will analyze these proteins in greater detail to identify any potential false positives. To do so, we will first need to fetch the corresponding proteins with $esl-sfetch$.

6.3.5.2 Extract the $hmmsearch$ hits from the Acanthamoeba proteome using $esl-sfetch$

$esl-sfetch$ is an auxiliary program from the HMMER3 package designed to index a proteome, and extract the hits found with $hmmsearch$ using a list of hit identifiers, as shown below:

# 1. Use an AWK one-liner to extract the accession numbers for the Acanthamoeba hmmsearch hits
awk '$0 !~ /^#/{print $1}' ACACA_Rab_hits_hmmsearch.out > ACACA_Rab_hmmsearch_hits.list

# 2. Before extracting the sequences from the Acanthamoeba proteome, we need to index it, and generate a FASTA database.
esl-sfetch --index ACACA.faa

# 3. Extract the sequences in FASTA format
esl-sfetch -f ACACA.faa ACACA_Rab_hmmsearch_hits.list > ACACA_Rab_hmmsearch_hits.faa

6.3.5.3 Validating the $hmmsearch$ hits (I): Analysis of the Acanthamoeba Rab sequence lengths with $fasta\_seqtool.awk$

$fasta\_seqtool.awk$ is a simple $AWK$ script to perform different types of analyses on FASTA sequencs. In its ‘runmode 5’, it prints the sequence lengths in its 7th output field. The output of the $fasta\_seqtool.awk$ is passed to col_sumStats.sh, which generates a summary of the length distribution of the sequences in the source FASTA file.

# compute basic sequence length summary statistics
fasta_toolkit.awk -R 5 ACACA_Rab_hmmsearch_hits.faa | sed '1d' | cut -f7 | col_sumStats.sh # version 0.5 of the script
fasta_toolkit.awk -R 6 ACACA_Rab_hmmsearch_hits.faa | cut -f2 | col_sumStats.sh # version 0.6 of the script <<< download from https://github.com/vinuesa/TIB-filoinfo

Min: 155
Mean: 259.052
Median: 216
Mode: 216
Max: 988

The output above shows that the median and mode are equal to 216, but there are also some significantly shorter and longer sequences.

Let us explore this in greater detail by printing only the rows for sequences of length > 216 in order to identify any mayor ‘jumps’ in sequence size.

# Note: we require fasta_toolkit.awk >= v0.6, available from https://github.com/vinuesa/TIB-filoinfo
fasta_toolkit.awk -R 6 ACACA_Rab_hmmsearch_hits.faa | awk -F "\t" '$2 > 216' | sort -t $'\t' -nk2,2

fasta_toolkit.awk -R 6 ACACA_Rab_hmmsearch_hits.faa | awk -F "\t" '$2 > 216' | sort -t $'\t' -nk2,2 > ACACA_Rab_hmmsearch_hits_sequence_lenghts.tsv

tr|L8GKG8|L8GKG8_ACACA Rasrelated protein Rab5, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_091860 PE=4 SV=1  218
tr|L8HAC7|L8HAC7_ACACA Ras-related protein Rab OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_126680 PE=3 SV=1    220
tr|L8HIJ7|L8HIJ7_ACACA Ras-related protein Rab-2-B OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_110390 PE=4 SV=1    220
tr|L8GLA9|L8GLA9_ACACA Likely rab family GTPbinding protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_246860 PE=4 SV=1   221
tr|L8H9Y3|L8H9Y3_ACACA RAB8D, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_253150 PE=4 SV=1    223
tr|L8GIC5|L8GIC5_ACACA Ras family protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_399850 PE=4 SV=1 228
tr|L8HAR7|L8HAR7_ACACA Rab GTPase, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_328080 PE=4 SV=1   228
tr|L8HER3|L8HER3_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_143990 PE=4 SV=1  231
tr|L8GNH9|L8GNH9_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_076440 PE=4 SV=1  233
tr|L8GH28|L8GH28_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_256400 PE=4 SV=1  237
tr|L8HJZ6|L8HJZ6_ACACA Ras-related protein Rab OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_296250 PE=3 SV=1    238
tr|L8GVE2|L8GVE2_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_087400 PE=4 SV=1  241
tr|L8GGW7|L8GGW7_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_151500 PE=4 SV=1  244
tr|L8GJ03|L8GJ03_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_093810 PE=4 SV=1  244
tr|L8GX96|L8GX96_ACACA Ras-related protein Rab OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_089990 PE=3 SV=1    249
tr|L8GIV3|L8GIV3_ACACA Ras subfamily Rab protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_091830 PE=4 SV=1  256
tr|L8H5G9|L8H5G9_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_206060 PE=4 SV=1  259
tr|L8GK29|L8GK29_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_243420 PE=4 SV=1  264
tr|L8H339|L8H339_ACACA RAB36 member RAS oncogene family protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_301250 PE=4 SV=1   273
tr|L8H6Y7|L8H6Y7_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_181840 PE=4 SV=1  275
tr|L8GII8|L8GII8_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_151930 PE=4 SV=1  298
----------------------------------------------------------------------------------------------------------------------------------- major jump in size
tr|L8GU85|L8GU85_ACACA Ras family protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_087420 PE=4 SV=1 401
tr|L8GVW1|L8GVW1_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_057810 PE=4 SV=1  482
tr|L8GU59|L8GU59_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_090170 PE=4 SV=1  514
tr|L8GTH9|L8GTH9_ACACA Fbox domain and Ankyrin repeat containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_010610 PE=4 SV=1  519
tr|L8H3F2|L8H3F2_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_201200 PE=4 SV=1  526
tr|L8GVE7|L8GVE7_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_087450 PE=4 SV=1  536
tr|L8HB60|L8HB60_ACACA PH domain containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_228770 PE=4 SV=1   988

This output reveals that: * there is a major jump in size for sequences > 298 aa * that there are multiple sequences that are over twice as long as the modal length (mode = 216) of the distribution

6.3.5.4 Validating the $hmmsearch$ hits (II): Verifying the presence of prenylation sites among the five C-terminal residues of Acanthamoeba Rabs

We will make again use of the extract_N-or_C-terminal_regions_and_compute_counts.pl Perl script to extract the five C-terminal residues of Acanthamoeba Rabs and count the number of cysteines found among them. Bona fide Rabs are expected to contain at least one cysteine in this C-terminal region.

# run the script to extract the five C-termina residues of each sequence in ACACA_Rab_hmmsearch_hits.faa, 
# and count the instances of cysteines in that region
extract_N-or_C-terminal_regions_and_compute_counts.pl -f ACACA_Rab_hmmsearch_hits.faa -o ACACA_Rab_hmmsearch_hits_5CtCs -n 5 -r C -a C

# filter out sequences from CACA_Rab_hmmsearch_hits_5CtCs_Cterm.tsv that do not contain at least one cysteine
awk '$3 == 0' ACACA_Rab_hmmsearch_hits_5CtCs_Cterm.tsv

tr|L8GSP6|L8GSP6_ACACA  HSTPK   0
tr|L8H896|L8H896_ACACA  KSRIH   0
tr|L8HGB4|L8HGB4_ACACA  EYGIR   0
tr|L8GXM9|L8GXM9_ACACA  PFATL   0
tr|L8GSR3|L8GSR3_ACACA  RDVGP   0
tr|L8GNH9|L8GNH9_ACACA  FISVK   0
tr|L8GJ03|L8GJ03_ACACA  LPRWH   0
tr|L8H0T2|L8H0T2_ACACA  RKRRT   0
tr|L8HER3|L8HER3_ACACA  ASRHL   0
tr|L8HB60|L8HB60_ACACA  AKKEN   0

As shown above, there are 10 Acanthamoeba Rab homologs found in our $hmmsearch$ run that do not comply with the required presence of a prenylation site at their C-termini. These sequences should not be considered canonical Rab GTPases!

Let us generate a file containing the list of non-canonical Acanthamoeba Rab GTPases, and use it to find their lengths by filtering our ACACA_Rab_hmmsearch_hits_sequence_lenghts.tsv file as shown below:

awk '$3 == 0{print $1}' ACACA_Rab_hmmsearch_hits_5CtCs_Cterm.tsv > ACACA_Rab_hmmsearch_hits_non-canonical_Rabs_lacking_CtCs.list

while read -r id; do grep "$id" ACACA_Rab_hmmsearch_hits_sequence_lenghts.tsv; done < ACACA_Rab_hmmsearch_hits_non-canonical_Rabs_lacking_CtCs.list | sort -t$'\t' -nk2,2

tr|L8HGB4|L8HGB4_ACACA RabE family small GTPase (Fragment) OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_376480 PE=4 SV=1    155
tr|L8GSR3|L8GSR3_ACACA Rab11/RabAfamily small GTPase OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_216930 PE=4 SV=1  186
----------------------------------------------------------------------------------------------------------------------------------------- \ below canonical Rab GTPase size
tr|L8GXM9|L8GXM9_ACACA GTPbinding protein YPTC5, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_383740 PE=4 SV=1 197
tr|L8H896|L8H896_ACACA RAB1B, member RAS oncogene family OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_384590 PE=4 SV=1  198
tr|L8GSP6|L8GSP6_ACACA Ras-related protein Rab-2A, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_164790 PE=4 SV=1   200
tr|L8H0T2|L8H0T2_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_036890 PE=4 SV=1  207
tr|L8HER3|L8HER3_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_143990 PE=4 SV=1  231
tr|L8GNH9|L8GNH9_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_076440 PE=4 SV=1  233
tr|L8GJ03|L8GJ03_ACACA Ras subfamily protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_093810 PE=4 SV=1  244
----------------------------------------------------------------------------------------------------------------------------------------- / above canonical Rab GTPase size
tr|L8HB60|L8HB60_ACACA PH domain containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_228770 PE=4 SV=1   988

Our analyses reveal that there are 10 non-canonical Rab GTPases among our 58 $hmmsearch$ hits lacking the required prenylation sites (cysteines) at their C-termini. Three of these sequences are outliers in the length distribution of the Acanthamoeba Rabs, and outside of the length range of canonical Rab GTPases previously calculated.

Based on this evidence, we should remove the non-canonical Rab GTPases from our set of 58 $hmmsearch$ hits. This can be easily done with the following calls to select_sequences_by_ID.pl, which selects (includes|excludes) sequences listed in the –ids IDs.list file, and writes the corresponding, filtered FASTA file to disk.

select_sequences_by_ID.pl is available on my GitHub repo.

# write a new FASTA file excluding the sequences lacking the C-terminal C's
select_sequences_by_ID.pl --fasta ACACA_Rab_hmmsearch_hits.faa --ids ACACA_Rab_hmmsearch_hits_non-canonical_Rabs_lacking_CtCs.list --output ACACA_Rab_hmmsearch_hits_with_CtCs.faa --exclude

# write a new FASTA file containing only the sequences lacking C-terminal C's
select_sequences_by_ID.pl --fasta ACACA_Rab_hmmsearch_hits.faa --ids ACACA_Rab_hmmsearch_hits_non-canonical_Rabs_lacking_CtCs.list --output ACACA_Rab_hmmsearch_hits_without_CtCs.faa --include

6.4 Align the Acanthamoeba $hmmsearch$ hits using $hmmalign$ and the Rab family-specific HMM

$hmmalign$ is another core program of the HMMER3 distribution. It is used to very rapidly align (homologous) sequences to the profile HMM.

This tool is used to generate the “full alignments” in Pfam from $hmmsearch$ hits found with the HMM generated from the seed alignments. Some of these “full alignments” contain many thousands of sequences.

# hmmalign :: align sequences to a profile HMM
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Usage: hmmalign [-options] <hmmfile> <seqfile>

# --trim : trim terminal tails of nonaligned residues from alignment
hmmalign --trim -o ACACA_Rab_hmmsearch_hits_hmmaling_trimmed.sto Rab.hmm ACACA_Rab_hmmsearch_hits.faa

hmmalign --trim -o ACACA_Rab_hmmsearch_hits_with_CtCs_hmmaling_trimmed.sto Rab.hmm ACACA_Rab_hmmsearch_hits_with_CtCs.faa

# have a look a the Stockholm-formatted alignment
less  ACACA_Rab_hmmsearch_hits_with_CtCs_hmmaling_trimmed.sto

The command shown above generates an alignment of the sequence regions in ACACA_Rab_hmmsearch_hits.faa that overlap with the HMM model Rab.hmm, trimming the flanking, non-aligned residues.

6.4.1 Convert the Stockholm-formatted alignment produced by $hmmalign$ to FASTA using $esl-reformat$

As mentioned before, the HMMER3 suite provides multiple small utilities to manipulate sequences and profiles. $esl-reformat$ can be used to convert between certain formats. Here we convert the Stockholm-formatted alignment produced by $hmmalign$, which cannot be read by $seaview$, to FASTA.

esl-reformat -o ACACA_Rab_hmmsearch_hits_hmmaling_trimmed.afa afa ACACA_Rab_hmmsearch_hits_hmmaling_trimmed.sto

esl-reformat -o ACACA_Rab_hmmsearch_hits_with_CtCs_hmmaling_trimmed.afa afa ACACA_Rab_hmmsearch_hits_with_CtCs_hmmaling_trimmed.sto

6.4.2 Explore the $hmmalign$ alignment of Acanthamoeba Rab homologues with $seaview$

# explore the alignment
seaview ACACA_Rab_hmmsearch_hits_hmmaling_trimmed.afa &

$hmmalign$ alignment of Acanthamoeba Rab GTPases.

Note that $hmmalign$ can only align the regions that overlap with the HMM model. Therefore, the alignment shown above, covers only the core Rab domain. It is therefore quite notorious that several sequences seem to have deletions within otherwise highly conserved sequence blocks. This is suspicious.

In a recent publication from our group Rivera et al. (2014), we discovered that the gene models of multiple Acanthamoeba Rab GTPases are wrong, including non-existing introns that explain at least some of the observed gaps in multiple sequence alignments of A. castellanii Rab7 paralogs with the canonical RAB7A_HUMAN protein.

Figure from: Rivera et al. (2024). Phylogenomic, structural, and cell-biological analyses reveal that Stenotrophomonas maltophilia replicates in acidified Rab7A-positive vacuoles of Acanthamoeba castellanii. Microbiology Spectrum, 2024 Mar 5;12(3):e0298823.. doi: 10.1128/spectrum.02988-23

The following figure shows the prediction we made for the correct A. castellanii Rab7A protein, which is critical in membrane trafficking along the phagocytic and autophagic pathways.

6.4.3 Align the full-length Acanthamoeba Rabs using $clustalo$ and the Rab family-specific HMM

As shown in a previous section, $clustalo$ can use externally provided HMMs to aid in the multiple sequence alignment process. Therefore, we can use the Rab.hmm to align the ACACA_Rab_hmmsearch_hits.faa FASTA, generating an alignment of the full-length sequences, which is not possible with $hmmalign$.

Furthermore, the MSA produced by $hmmalign$, even after conversion to FASTA format, contains dot characters that are not acceptable in the FASTA format. Therefore, it is certainly convenient to do HMM-guided full-length MSAs with $clustalo$.

clustalo -i ACACA_Rab_hmmsearch_hits.faa --hmm-in Rab.hmm -o Rab_SPhyd94_plusACA_cluoHMM.afa --threads 4

seaview Rab_SPhyd94_plusACA_cluoHMM.afa &

Inspection of the Rab_SPhyd94_plusACA_cluoHMM.afa alignment confirms that it was performed on the full-length sequences. As seen already in the alignment produced with $hmmalign$, several sequences contain suspicious gaps in otherwise highly conserved signature regions.

From these analyses we can conclude that the A. castellanii proteome is poorly and inconsistently annotated, and that the gene models (exon-intron structure) of the genome require revision.

6.5 Annotating proteins by interrogating the Pfam library of profile HMMs with $hmmscan$

As a final exercise, we will learn how to use $hmmscan$ to search with sequences against a library of HMMs.

We will use the subset of Acanthamoeba Rab hits found by $hmmsearch$ with our Rab.hmm model that are longer than 300 residues. These proteins are notably larger than the canonical Rabs used to construct the Rab.hmm model, which are between 200-250 residues, as shown before. We want to know if these larger proteins contain additional domains, and identify them by interrogating the Pfam A library of profile HMMs for known domains.

6.5.1 Identify and extract Acanthamoeba Rab GTPases > 300 amino acids in length from the indexed ACACA.faa proteome using $esl-sfetch$

We will use again the handy $AWK$ script $fasta\_toolkit.awk$ to compute the lengths of each sequence, generating a list of IDs for the sequences that are > 300 residues long, and extract them from the previously indexed ACACA.faa proteome with $esl-sfetch$.

# print the IDs of Acanthamoeba Rab sequences that are > 300 aa long using fasta_toolkit.awk
#  and filtering its output using cut, sort and awk to generate a sorted list of sequence lengths.
# fasta_toolkit.awk -R 5 ACACA_Rab_hmmsearch_hits.faa | sed '1d' | cut -f1,7 | sort -t $'\t' -nk2,2 | awk -F"\t" '$2 > 300' | cut -d' ' -f1 # <<< v0.5
fasta_toolkit.awk -R 6 ACACA_Rab_hmmsearch_hits.faa | sort -t $'\t' -nk2,2 | awk -F"\t" '$2 > 300' | cut -d' ' -f1

# fetch the > 300 residue-long sequences from the previously indexed Acanthamoeba proteome database using esl-sfetch
# Note the use of the <() (process substitution) Bash idiom to avoid writing intermediary files
esl-sfetch -f ACACA.faa <(fasta_toolkit.awk -R 6 ACACA_Rab_hmmsearch_hits.faa | sort -t $'\t' -nk2,2 | awk -F"\t" '$2 > 300' | cut -d' ' -f1) > ACA_Rab_hits_gt300aa.faa

# the code below would produce the same result as the command shown above, but writing the list of sequences with > 300 residues to the file ACA_Rab_hit_IDs_gt300aa.list
#fasta_toolkit.awk -R 6 ACACA_Rab_hmmsearch_hits.faa | sort -t $'\t' -nk2,2 | awk -F"\t" '$2 > 300' | cut -d' ' -f1 > ACA_Rab_hit_IDs_gt300aa.list
#esl-sfetch -f ACACA.faa ACA_Rab_hit_IDs_gt300aa.list > ACA_Rab_hits_gt300aa.faa

6.5.2 Scan the Acanthamoeba Rab GTPases > 300 amino acids to annotate their Pfam domains using $hmmscan$

Rather than searching a single profile HMM against a collection of sequences, we want to annotate a protein sequences by using them as queries to search against a collection of profiles of all known domains and repeats available in Pfam.

$hmmscan$ takes as input a query file containing one or more protein sequences to annotate, and a profile database to search them against. The profile database might be Pfam, SMART, or TIGRFams, for example, or another collection of your choice, including custom-made HMMs.

A profile “database” file is just a concatenation of individual profile files. To create a database file, you can either build individual profile files and concatenate them, or you can concatenate Stockholm alignments and use $hmmbuild$ to build a profile database from them in one command.

Here we will use the Pfam A library of profile HMMs for known domains. The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs), currently (v37.0) holding 21,979 entries, grouped in 709 clans.

6.5.2.1 Compressing and indexing large HMM text flatfiles (e.g. Pfam-A.hmm) with $hmmpress$

$hmmscan$ has to read a lot of profiles in a hurry, and HMMER’s HMM text flatfiles are bulky. To accelerate this, $hmmscan$ depends on binary compression and indexing of the flatfiles. Therefore, before running $hmmscan$, you need to compress and index your profile database with the $hmmpress$ program: hmmpress your_profile_db

I compressed and indexed the Pfam-A.hmm profile database with the following command:

hmmpress Pfam-A.hmm

Remember that we can access the path to the Pfam-A database by interpolating the variable “${pfam_db}”.

6.5.2.2 Writing a $hmmscan$ parseable table of per-domain hits with `–domtblout` , and using model-specific cutoffs

When running $hmmscan$ on Pfam, I like to use two very useful options:

Saving a parseable table of per-domain hits to file with the –domtblout option
Filtering the output using model-specific thresholding encoded in the Pfam profiles

#  hmmscan -h
# hmmscan :: search sequence(s) against a profile database
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Usage: hmmscan [-options] <hmmdb> <seqfile>
#... truncated

#Options controlling output:
#  -o <f>           : direct output to file <f>, not stdout
#  --tblout <f>     : save parseable table of per-sequence hits to file <f>
#  --domtblout <f>  : save parseable table of per-domain hits to file <f>
#... truncated
#Options for model-specific thresholding:
#  --cut_ga : use profile's GA gathering cutoffs to set all thresholding
#  --cut_nc : use profile's NC noise cutoffs to set all thresholding
#  --cut_tc : use profile's TC trusted cutoffs to set all thresholding


hmmscan --tblout ACA_Rab_hits_gt300aa_hmmscan_PfamA_tbl_cut_ga.out --domtblout ACA_Rab_hits_gt300aa_hmmscan_PfamA_domtbl_cut_ga.out --noali --notextw --cut_ga "${pfam_db}"/Pfam-A.hmm ACA_Rab_hits_gt300aa.faa

The relevant section of the Pfam userman relating to the model-specific thresholding cutoffs is shown below:

  GA   Gathering threshold:   

       Search threshold to build the full alignment.

       GA lines are the thresholds in bits used in the hmmsearch
       command line.  An example GA line is shown below:

       GA   25.00 15.00;

       The order of the thresholds is sequence, domain

       The corresponding hmmsearch command line for the HMM
       would be:

       hmmsearch -T 25 --domT 15 HMM DB

       The -T option specifies the whole sequence score in bits, and
       the --domT option specifies the per-domain threshold in bits.

  NC   Noise cutoff:

       This field refers to the bit scores of the highest scoring
       match not in the full alignment.

       An example NC line is shown below

       NC   19.50 18.10;

       As with the GA line, this field contains two numbers - the
       first number in the set refers to the highest whole sequence
       score in bits of a match not in the full alignment, and the
       second number specifies the highest per-domain score in bits of
       a match not in the full alignment.  These two scores may not
       refer to the same sequence.

  TC   Trusted cutoff:

       This field refers to the bit scores of the lowest scoring match
       in the full alignment.

       An example TC line is shown below

       TC   23.00 16.10;

       As with the GA line, this field contains two numbers - the
       first set refers to the lowest whole sequence score in bits of
       a match in the full alignment, and the second number specifies
       the lowest per-domain score in bits of a match in the full
       alignment.  These two scores may not refer to the same
       sequence.

Below I’m showing the first lines of the ACA_Rab_hits_gt300aa_hmmscan_PfamA_domtbl_cut_ga.out file, which contains 22 fields.

# display first 40 lines of the file
head -40 ACA_Rab_hits_gt300aa_hmmscan_PfamA_domtbl_cut_ga.out

$hmmscan$ domtblout output table of Pfam domains found in long (>300 aa) Acanthamoeba Rab GTPases.

In protein search programs, the –domtblout option produces the domain hits table shown above. There is one line for each domain. There may be more than one domain per sequence, as is the case for all proteins shown above.

The domtblout domain table has 22 space-delimited fields, followed by a free text target sequence description, as follows:

(1) **target name**: The name of the target sequence or profile.
(2) **target accession**: Accession of the target sequence or profile, or ’-’ if none is available.
(3) **tlen**: Length of the target sequence or profile, in residues. This (together with the query length) is useful for interpreting where the domain coordinates (in subsequent columns) lie in the sequence.
(4) **query name**: Name of the query sequence or profile.
(5) accession: Accession of the target sequence or profile, or ’-’ if none is available.
(6) **qlen**: Length of the query sequence or profile, in residues.
(7) E-value: E-value of the overall sequence/profile comparison (including all domains).
(8) score: Bit score of the overall sequence/profile comparison (including all domains), inclusive of a null2 bias composition correction to the score.
(9) bias: The biased composition score correction that was applied to the bit score.
(10) #: This domain’s number (1..ndom).
(11) of: The total number of domains reported in the sequence, ndom.
(12) c-Evalue: The “conditional E-value”, a permissive measure of how reliable this particular domain may be. 
(13) **i-Evalue**: The “independent E-value”, the E-value that the sequence/profile comparison would have received if this were the only domain envelope found in it, excluding any others. This is a **stringent measure** of how reliable this particular domain may be. It uses the total number of targets in the target database.
(14) **score**: The bit score for this domain.
(15) **bias**: The biased composition (null2) score correction that was applied to the domain bit score.
(16) from (hmm coord): The start of the MEA (maximum expected accuracy) alignment of this domain with respect to the profile, numbered 1..N for a profile of N consensus positions.
(17) to (hmm coord): The end of the MEA alignment of this domain with respect to the profile, numbered 1..N for a profile of N consensus positions.
(18) **from (ali coord)**: The start of the MEA alignment of this domain with respect to the sequence, numbered 1..L for a sequence of L residues.
(19) **to (ali coord)**: The end of the MEA alignment of this domain with respect to the sequence, numbered 1..L for a sequence of L residues.
(20) **from (env coord)**: The start of the domain envelope on the sequence, numbered 1..L for a sequence of L residues. The envelope defines a subsequence for which their is substantial probability mass supporting a homologous domain, whether or not a single discrete alignment can be identified.
(21)** to (env coord)**: The end of the domain envelope on the sequence, numbered 1..L for a sequence of L residues.
(22) **acc**: The mean posterior probability of aligned residues in the MEA alignment; a measure of how reliable the overall alignment is (from 0 to 1, with 1.00 indicating a completely reliable alignment according to the model).
(23) description of target: The remainder of the line is the target’s description line, as free text.

Since the output has many columns, it is generally useful to parse the per-domain output table file to produce a less detailed but still informative output.

I like to use $AWK$ one-liners for these tasks, like the one provided below, which provides a convenient overview of key parameters.

awk 'BEGIN{OFS="\t"; print "target\taccession\ttlen\tquery_name\tqlen\tdE-value\tdScore\tDbias\ti-Evalue\tscore\tbias\thmmStart\thmmEnd\tenvStart\tenvEnd\tacc"} !/^#/ {print $1,$2,$3,$4,$6,$7,$8,$9,$13,$14,$15,$16,$17,$20,$21,$22}' ACA_Rab_hits_gt300aa_hmmscan_PfamA_domtbl_cut_ga.out | column -t

I use this simplified version of the –domtblout so often, that I’ve written a simple $Bash$ wrapper around it, called simplify_domtblout.sh. You can grab $simplify\_domtblout.sh$ from my GitHub repo.

Below the output of the command provided above, showing only the lines corresponding to the first (shortest) and last (largest) query sequences.

target      accession   tlen  query_name              qlen  dE-value  dScore  Dbias  i-Evalue  score  bias  hmmStart  hmmEnd  envStart  envEnd  acc
Ras         PF00071.27  162   tr|L8GU85|L8GU85_ACACA  401   2.3e-44   151.1   0.0    3.2e-44   150.6  0.0   1         157     189       351     0.98
Roc         PF08477.18  119   tr|L8GU85|L8GU85_ACACA  401   2.8e-28   98.8    0.0    4.1e-28   98.3   0.0   1         119     189       303     0.89
Arf         PF00025.26  174   tr|L8GU85|L8GU85_ACACA  401   1.1e-15   57.9    0.0    1.5e-15   57.4   0.0   15        171     182       350     0.88
Ank_4       PF13637.11  50    tr|L8GU85|L8GU85_ACACA  401   4.9e-12   46.0    0.4    1.8e-05   25.2   0.1   7         50      1         49      0.91
Ank_4       PF13637.11  50    tr|L8GU85|L8GU85_ACACA  401   4.9e-12   46.0    0.4    5.2e-06   26.9   0.0   1         49      33        81      0.92
Ank_2       PF12796.12  90    tr|L8GU85|L8GU85_ACACA  401   2.4e-11   44.3    0.0    4.3e-11   43.5   0.0   5         80      1         89      0.84
Ank         PF00023.35  33    tr|L8GU85|L8GU85_ACACA  401   5.6e-11   42.3    2.0    3.2e-07   30.6   0.1   1         29      28        60      0.92
F-box-like  PF12937.12  47    tr|L8GU85|L8GU85_ACACA  401   1.2e-10   41.3    0.2    2.7e-10   40.1   0.2   2         42      96        141     0.96
Ank_5       PF13857.11  56    tr|L8GU85|L8GU85_ACACA  401   8.6e-08   32.7    0.1    3.7e-07   30.7   0.1   1         48      13        68      0.89
Ank_3       PF13606.11  30    tr|L8GU85|L8GU85_ACACA  401   6.8e-06   26.3    0.2    0.00047   20.7   0.0   1         28      28        57      0.91
F-box       PF00646.38  43    tr|L8GU85|L8GU85_ACACA  401   7.1e-05   22.8    0.1    0.00013   22.0   0.1   1         42      96        139     0.94
... truncated
Ras         PF00071.27  162   tr|L8HB60|L8HB60_ACACA  988   4.2e-63   212.0   0.0    2.9e-26   92.2   0.1   2         160     18        208     0.83
Ras         PF00071.27  162   tr|L8HB60|L8HB60_ACACA  988   4.2e-63   212.0   0.0    2.7e-32   111.8  0.0   2         160     431       594     0.94
Roc         PF08477.18  119   tr|L8HB60|L8HB60_ACACA  988   9.8e-23   80.9    0.0    1.2e-08   35.4   0.0   2         119     18        153     0.68
Roc         PF08477.18  119   tr|L8HB60|L8HB60_ACACA  988   9.8e-23   80.9    0.0    5.5e-10   39.8   0.0   2         119     431       547     0.79
PH          PF00169.34  105   tr|L8HB60|L8HB60_ACACA  988   4e-21     75.8    1.3    1.2e-12   48.6   0.0   3         104     212       311     0.92
GTP_EFTU    PF00009.32  188   tr|L8HB60|L8HB60_ACACA  988   1e-20     74.4    0.0    4.9e-11   42.8   0.0   5         186     15        208     0.84
GTP_EFTU    PF00009.32  188   tr|L8HB60|L8HB60_ACACA  988   1e-20     74.4    0.0    1.1e-06   28.7   0.0   18        186     384       594     0.80
MMR_HSR1    PF01926.28  113   tr|L8HB60|L8HB60_ACACA  988   7.3e-14   52.1    0.1    0.00017   21.9   0.0   1         99      18        151     0.62
MMR_HSR1    PF01926.28  113   tr|L8HB60|L8HB60_ACACA  988   7.3e-14   52.1    0.1    7.8e-06   26.3   0.0   1         86      431       545     0.72
Arf         PF00025.26  174   tr|L8HB60|L8HB60_ACACA  988   1.5e-13   50.8    0.0    3.9e-06   26.7   0.0   16        172     422       608     0.82

As expected, both proteins (L8GU85_ACACA and L8HB60_ACACA) have strong hits to the Ras profile HMM model, and notably weaker hits to domains of other Ras superfamily members that overlap (see envStart and envEnd) with the Ras domain hit coordinates on the query.

Note that the Ras domain hit has the largest score, although the model length (tlen) is not the largest one.

However, the first (shortest) sequence (L8GU85_ACACA) contains multiple Ankyrin repeats, and an F-box like domain on its N-terminal extension.

The last (largest) sequence (L8HB60_ACACA), interestingly, seems to be a fusion protein containing two Ras domains located at positions 18-208 and 431-594 of the 988 residue-long protein, separated by a PH domain (Pleckstrin homology) located between positions 212-311. This is certainly not a canonical Rab GTPase. In addition, as we found when validating hits for prenylation sites, L8HB60 does not contain cysteines among its five C-terminal residues.

You can interrogate IntePro with a PFAM accession to find out more about its function and structure.

For example:

PF00023: (ANK) Ankyrins are multifunctional adaptors that link specific proteins to the membrane-associated, spectrin-actin cytoskeleton. This repeat-domain is a ‘membrane-binding’ domain of up to 24 repeated units, and it mediates most of the protein’s binding activities. Repeats 13-24 are especially active, with known sites of interaction for the Na/K ATPase, Cl/HCO(3) anion exchanger, voltage-gated sodium channel, clathrin heavy chain, and L1 family cell adhesion molecules. The ANK repeats are found to form a contiguous spiral stack such that ion transporters like the anion exchanger associate in a large central cavity formed by the ANK repeat spiral, while clathrin and cell adhesion molecules associate with specific regions outside this cavity.
PF12937: F-box-like domain First identified in cyclin-F as a protein-protein interaction motif, the F-box is a conserved domain that is present in numerous proteins with a bipartite structure Through the F-box, these proteins are linked to the Skp1 protein and the core of SCFs (Skp1-cullin-F-box protein ligase) complexes. SCFs complexes constitute a new class of E3 ligases. They function in combination with the E2 enzyme Cdc34 to ubiquitinate G1 cyclins, Cdk inhibitors and many other proteins, to mark them for degradation. The binding of the specific substrates by SCFs complexes is mediated by divergent protein-protein interaction motifs present in F-box proteins, like WD40 repeats, leucine rich repeats or ANK repeats.
PF00169: Pleckstrin homology (PH) domain or (PHIP) is a protein domain of approximately 120 amino acids that occurs in a wide range of proteins involved in intracellular signaling or as constituents of the cytoskeleton.

Note that each of these Pfam entries has references to the relevant papers not shown in the output presented above.

6.5.3 Display of protein structures with PyMOL

Another useful skill to to have for studying protein domains is to be able to display them in customized ways. Here I will show how to display a protein using PyMOL, a user-sponsored molecular visualization system on an open-source foundation, maintained and distributed by Schrödinger. Here is the PyMOL Wiki.

Another popular molecular graphics visualization program is RasMol, originally developed by Roger Sayle. Here is the RasMol manual.

6.5.3.1 Fetching AlphaFold structures from AlphaFoldDB

First we need to fetch them the structures for the L8GU85 and L8HB60 Acanthamoeba proteins. You can explore them on UniProt: L8GU85, L8HB60, and fetch them directly from there.

We can fetch the structures predicted by AlphaFold for both proteins from the AlphaFoldDB with the following commands:

# fetch the "shortest" protein L8GU85 and remove the HTML header section
wget -c https://alphafold.ebi.ac.uk/files/AF-L8GU85-F1-model_v4.pdb

# fetch the "largest" protein L8HB60 and remove the HTML header section
wget -c https://alphafold.ebi.ac.uk/files/AF-L8HB60-F1-model_v4.pdb

6.5.3.2 Renedering structures with PyMOL

Now let us prepare two coordinate tables for the best-scoring, non-overlapping domains found in our $hmmscan$ search against the Pfam-A profile database.

Based on the tbldomout output table shown above, I have written the best-scoring, non-overlapping domains for each protein in two separate tab-delimited files. Its contents are shown below:

# best-scoring, non-overlapping domains for L8GU85 
Ras 189 351
Ank_2   1   89
F-box-like  96  141

# best-scoring, non-overlapping domains for L8HB60 
Ras1    18  208
Ras2    431 594
PH  212 311

Here some PyMOL code to render the proteins with the corresponding domains/regions highlighted in different colors. Note that I have also highlighted the N- and C-terminal regions as colored spheres (blue and red, respectively).

Note that the command set_color color-name, [r,b,g] requires the colors to be coded as RBG values between 0 and 1.

Below the PyMOL code for rendering the domains in L8GU85, which should be typed into the PyMOL command prompt.

# load PyMOL
pymol &
  
# let's write a PyMOL pml script (logfile) that will hold all commands executed on the PyMOL prompt
#File > logfile > open > L8GU85.pml
log_open L8GU85.pml

# Load the protein structure file
load AF-L8GU85-F1-model_v4.pdb

# Render as cartoon with fancy helices
as cartoon
set cartoon_fancy_helices, 1

# Color protein backbone in gray60
color gray60

# Define and set alternating domain colors
select Ras, resi 189-351
color firebrick, Ras
select Ank_2, resi 1-89
color limon, Ank_2
select F-box-like, resi 96-141
color violetpurple, F-box-like
deselect

# Highlight N- and C- terminal residues
show spheres, resi 1
color red, resi 1  # N-terminus in red
show spheres, resi 401 
color blue, resi 401   # C-terminus in blue

# Highlight C398 (potential prenylation site) as yellow sticks
show sticks, resi 398 # show cysteine 398 in the C-terminal region as sticks
color yellow, resi 398

# Turn and resize the structure as you desire and then save as png file to disk
ray 1500
png L8GU85_with_colored_Pfam_domains.png

log_close
quit

And here the PyMOL code for rendering L8HB60:


# Load the protein structure file
load AF-L8HB60-F1-model_v4.cif

# Render as cartoon with fancy helices
as cartoon
set cartoon_fancy_helices, 1

# Color protein backbone as gray60
color gray60

# Define and set alternating domain colors
select Ras_like, resi 18-208
color orange, Ras_like
select Ras, resi 431-594
color firebrick, Ras
select PH, resi 212-311
color yellow, PH
deselect

# Highlight N- and C- terminal residues
show spheres, (resi 1)
color blue, (resi 1)  # N-terminal in blue
show spheres, (resi 988) 
color red, (resi 988)   # C-terminal in red

# Turn and resize the structure as you desire and then save as png file to disk
ray 1500
png L8HB60_with_colored_Pfam_domains.png

We can run these lines from within $PyMOL$, or save them as $PyMOL$ script (.pml) files and call them directly from $PyMOL$. Once loaded, we tweak the output a bit to set the colors as we like, as shown in the following screenshot:

Using PyMOL to render a protein structure and highlight Pfam domains in distinct colors.

Final $PyMOL$ rendering for AF-L8GU85-F1 (L8GU85 Rab, 401 aa):

PyMOL rendering of AF-L8GU85-F1 with the Ank_2 (“yellow”), F-box-like (“magenta”), and Ras (“firebrick”) Pfam domain highlighted in distinct colors.
Final $PyMOL$ rendering for the largest Acanthamoeba Rab GTPase (L8HB60, 988 aa):

PyMOL rendering of AF-L8HB60-F1 with the Ras_like (“oragne”), PH (“limon”), and Ras2 (“firebrick”) Pfam domain highlighted in distinct colors.

6.6 HMMER exercise

Perform the following tasks, documenting the commands used, and results obtained, in a fashion similar to that used throughout this tutorial

Identify, count, and extract the bona fide Rab GTPase hits for Arabidopsis thaliana and Caenorhabditis elegans found in the $hmmsearch$ run performed on the Swiss-Prot database (uniprot_sprot.fasta file)
Inspect the hits for length and presence of at least one cysteine at their C-termini, document and remove any suspicious ones, add the bona fide ones to the Rab_SPhyd94.faa file, and align the extended FASTA using $clustalo$ with two iterations and Kimura-corrected distances
Manually inspect the MSA from the previous step, remove flanking regions containing mostly gapped sites that accommodate the N- and C-terminal extensions found in some proteins, and build a new Rab-family HMM from the trimmed MSA
Calibrate the new model by interrogating the Swiss-Prot sequences. What would you suggest is a reasonable sequence score cutoff for this new HMM? Present evidence in a fashion similar to that used in previous sections of this tutorial.
Using the new model and score cutoff, how many canonical Rab GTPases are found for Danio rerio, Bos taurus, and Canis familiares when interrogating the uniprot_sprot.fasta file?

7 Recommended readings and references

7.1 Multiple sequence and HMM-HMM alignments

7.2 Clustal-omega

Clustal.org
Sievers et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11:7:539. doi: 10.1038/msb.2011.75
Sievers & Higgins 2018. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 2018. Jan;27(1):135-145. doi: 10.1002/pro.3290
Sievers & Higgins 2014. Clustal Omega. Current Protocols in Bioinformatics

7.3 Profile hidden Markov models and HMMER3

Sean R Eddy (1998). Profile hidden Markov models. Bioinformatics 14(9):755-63
Sean R Eddy (2004). What is a hidden Markov model? Nat Biotechnol 22(10):1315-6
Söding J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7):951-60
Sean R Eddy (2011). Accelerated Profile HMM Searches. PLoS Comput Biol; 7(10):e1002195
HMMER.org
HMMER3.4 user guide

7.4 The Ras superfamily of GTPases, and Rab GTPases in Acanthamoeba castellanii

Krister Wennerberg, Kent L Rossman, Channing J Der (2005). The Ras superfamily at a glance. J Cell Sci. 2005 Mar 1;118(Pt 5):843-6. doi:10.1242/jcs.01660.
Stenmark H. (2009). Rab GTPases as coordinators of vesicle traffic. Nature Reviews Molecular Cell Biology volume 10, 513-525.
Javier Rivera, Julio C. Valerdi-Negreros, Diana M. Vázquez-Enciso, Fulvia-Stefany Argueta-Zepeda, Pablo Vinuesa* 2024. Phylogenomic, structural, and cell-biological analyses reveal that Stenotrophomonas maltophilia replicates in acidified Rab7A-positive vacuoles of Acanthamoeba castellanii. Microbiology Spectrum Mar 5;12(3):e0298823.

Multiple sequence alignments, and profile hidden Markov models

Pablo Vinuesa, CCG-UNAM, Cuernavaca, México; @pvinmex

2024-11-25