The following tutorials teach how to run clustalw2 and clustal-omega from the command line to generate standard, profile-profile, sequence-profile, and HMM-assisted multiple sequence alignments (MSAs).
It will also demonstrate how to build a profile hidden Markov models (HMMs) from an MSA, calibrate, and use it to search for distant homologs with high sensitivity and specificity, with programs from the HMMER3 suite.
All exercises will be performed in a Linux environment. Therefore, this tutorial will also introduce some useful AWK, Bash, and Perl idioms and scripts to analyze the results efficiently from the command line.
This tutorial expects that the user has access to a Unix/Linux environment with the standard Bash shell installed, as well as the clustalw2, clustal-omega and HMMER3 programs installed, and found in the $PATH. These programs are freely available for download from the clustal.org and HMMER.org servers.
Multiple sequence alignments (MSAs) are essential in most bioinformatics analyses that involve comparing homologous sequences. The exact or exhaustive approach to computing an optimal alignment has a computational complexity of \(O(L^N)\) for \(N\) sequences of length \(L\), making it prohibitive for even small numbers of sequences (Sievers et al. 2011). Therefore, all MSA algorithms implement some sort of heuristics (see Katho K. ed. 2021), such as the progressive multiple sequence alignment algorithm first implemented in the the popular Clustal family programs.
Clustal programs are used for multiple sequence alignment of DNA and protein sequences. The software and its algorithms have gone through several iterations, with \(Clustal{\Omega}\) (Omega) being the latest version, published in of 2011 (Sievers et al. 2011). It is available as standalone software, via a web interface, and through a Web server hosted by the European Bioinformatics Institute.
Clustal has been an important bioinformatic software, with two of its academic publications ranking among the top 100 most cited papers of all time.
Here we will demonstrate the use of the current command-line versions: clustalw2/clustalX and clustal-omega to perform:
Both \(Clustalw2\) and \(Clustal{\Omega}\) can read multiple sequence formats, including CLUSTAL and FASTA, and write the computed alignments in even more formats, including STOCKHOLM, NEXUS, and PHYLIP.
The following code snippets will set up the working directory for the tutorials, and fetch or generate symbolic links to the required sequences and scripts.
# i) Change directory into your working directory and make the following new directory to hold the sequence data for the exercises
if [ ! -d sesion3_clustal_y_hmmer ]; then
mkdir sesion3_clustal_y_hmmer && cd sesion3_clustal_y_hmmer
else
cd sesion3_clustal_y_hmmer
fi
# ii) use wget on the command line to fetch the required files from my TIB-filoinfo GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion4_alineamientos/sequences_for_alingment.tgz
if [ -s sequences_for_alingment.tgz ]
then
if file sequences_for_alingment.tgz | grep gzip &> /dev/null
then
echo "unpacking sequences_for_aligment.tgz ..."
tar -xzf sequences_for_alingment.tgz
else
echo "ERROR: This is not a gzip file"
fi
else
echo "ERROR: sequences_for_aligment.tgz not found"
fi
# iii) Download the required scripts to your ~/bin directory
base_url=https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/refs/heads/master
if [ -d ~/bin ]; then
cd ~/bin
else
mkdir ~/bin && cd ~/bin
fi
# This is a Bash array holding the list of scripts to be downloaded
scripts=(col_sumStats.sh split_fasta.pl translate_fastas.pl prot2cdnAln.pl convert_aln_format_batch_bp.pl fasta_toolkit.awk extract_N-or_C-terminal_regions_and_compute_counts.pl select_sequences_by_ID.pl)
# Iterate over the array elements, fetching each script with 'wget -c' and making them executable with 'chmod 755'
for f in "${scripts[@]}"
do
wget -c "$base_url"/"$f"
[ -s "$f" ] && chmod 755 "$f"
done
# Change directory back to the previous working directory
cd -
# list the directory contents
pwd && ls
# make a symlink to the Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa sequences and Ras_fam_PF00071.hmm
ln -s /export/space3/users/vinuesa/DBs/sequences/Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa .
ln -s /export/space3/users/vinuesa/DBs/Pfam/Ras_fam_PF00071.hmm .
We are ready to start the hands-on tutorials on running clustal and hmmer from the command line.
Clustalw implements a fast version of the progressive multiple sequence alignment algorithm (sketched below), featuring a fast and simple method for making “guide trees.” These are computed using fast clustering methods (UPGMA or NJ) of a distance matrix, which is calculated from the sequences using either exhaustive (dynamic programming) or heuristic (k-mer distances) methods.
The resulting guide trees are used to define the order of alignment during the later progressive alignment phase. The general idea is to start with alignments of just two sequences, usually the closest ones in the data set. The alignment is then built up by aligning sequences to sequences (s-s), alignments (profiles) with each other (p2p alignments), or sequences to alignments (s2p), according to the topology of the guide tree.
The complexity of the guide tree construction is usually \(O(N^2)\) for \(N\) sequences because all \(N\) sequences have to be compared to each other. Earlier versions of Clustal used fast word based alignments for these comparisons, which made it memory efficient and fast enough to work on personal computers with up to a few hundred sequences (Sievers & Higgins, 2018).
Clustalw can be freely obtained from the Clustalw2 webpage. Clustal 2 comes in two flavors: the command-line version Clustal W, and the graphical version Clustal X. Precompiled executables for Linux, Mac OS X, and Windows are available.
Here we are going to use only the command-line version Clustalw.
Note that:
Below is a compilation of some of the most frequently used \(clustalw\) command-line options:
#====================================================================================== # Basic clustalw usage and options #-------------------------------------------------------------------------------------- # explore the manuals man clustalw clustalw -help clustalw -fullhelp # minimal clustal command line call clustalw -infile=fasta_no_alineado -outfile=archivo_cluwaln.afa # clustalw call with modified parameters clustalw -infile=archivo.fasta -align -pwmatrix=blosum -pwgapopen=12 -pwgapext=0.2 -matrix=blosum -gapopen=12 -gapext=0.2 -outorder=aligned -convert -outfile=archivo_aln1.phy -output=<(phylip|clustal|fasta|nexus)>
The next lines show how to align the the provided FASTA files with GDP sequences from prokaryotes and eukaryotes, writing the resulting multiple sequence alignments (MSAs) in FASTA or Clustal formats.
# 1. default clustalw alignments
clustalw -infile=4_GDP_procar_ualn.faa -outfile=4_GDP_procar_cluAln.faa -output=fasta
clustalw -infile=6_GDP_eucar_ualn.faa -outfile=6_GDP_eucar_cluAln.faa -output=fasta
# 2. This clustalw call uses ~ 10% higher gap-open and gap extension penalties, both for the pairwise and multiple sequence alignment blocks
clustalw -infile=6_GDP_eucar_ualn.faa -outfile=6_GDP_eucar_cluAln2.aln -output=clustal -pwgapopen=13 -pwgapext=0.22 -gapopen=13 -gapext=0.22
# 1. compute a profile to profile alignment of the default alignments
clustalw -profile -profile1=6_GDP_eucar_cluAln.faa -profile2=4_GDP_procar_cluAln.faa -outfile=10_GDP_eucar_prokar_prof2prof.aln -output=clustal
# 2. compute a profile to profile alignment using the eukaryotic 6_GDP_eucar_cluAln2.aln alignment
clustalw -profile -profile1=6_GDP_eucar_cluAln2.aln -profile2=4_GDP_procar_cluAln.faa -outfile=10_GDP_eucar_prokar_prof2prof2.aln -output=clustal
The Clustal format can be conveniently visualized directly in the terminal.
The following command will allow us to compare the N-terminal regions of both profile alignments.
# display the 10_GDP_eucar_prokar.aln MSA written in clustal format
head -15 10_GDP_eucar_prokar_prof2prof.aln 10_GDP_eucar_prokar_prof2prof2.aln
==> 10_GDP_eucar_prokar_prof2prof.aln <== CLUSTAL 2.1 multiple sequence alignment gpdhuman ---------------MASKK----------------VCIVGSGNWGSAIAKIVGGNAAQL gpdarabit ----------------AGKK----------------VCIVGSGDWGSAIAKIVGGNAAQL gpdamouse ----------------AGKK----------------VCIVGSGNWGSAIAKIVGSNAGRL gpdadrome ---------------MADKVN---------------VCIVGSGNWGSAIAKIVGANAAAL gpdacaeel ---------------MSPKK----------------VTIIGSGNWGSAIARIVGSTTKSF gpd1yeast MSAAADRLNLTSGHLNAGRKRSSSSVSLKAAEKPFKVTVIGSGNWGTTIAKVVAENCKGY gpdaecoli -------------MNQRNAS----------------MTVIGAGSYGTALAITLARNGHEV gpdahaein -------------MITSQTP----------------ITVLGAGSYGTALAITFSRNGSPT gpdpseu -----------------------------------------MTSCCGAVAKTRWRN---- gpdabacsu -----------------MKK----------------VTMLGAGSWGTALALVLTDNGNEV . ::* . ==> 10_GDP_eucar_prokar_prof2prof2.aln <== CLUSTAL 2.1 multiple sequence alignment gpdhuman -------------------------------MASKKVCIVGSGNWGSAIAKIVGGNAAQL gpdarabit --------------------------------AGKKVCIVGSGDWGSAIAKIVGGNAAQL gpdamouse --------------------------------AGKKVCIVGSGNWGSAIAKIVGSNAGRL gpdadrome ------------------------------MADKVNVCIVGSGNWGSAIAKIVGANAAAL gpdacaeel -------------------------------MSPKKVTIIGSGNWGSAIARIVGSTTKSF gpd1yeast MSAAADRLNLTSGHLNAGRKRSSSSVSLKAAEKPFKVTVIGSGNWGTTIAKVVAENCKGY gpdaecoli -----------------------------MNQRNASMTVIGAGSYGTALAITLARNGHEV gpdahaein -----------------------------MITSQTPITVLGAGSYGTALAITFSRNGSPT gpdpseu -----------------------------------------MTSCCGAVAKTRWRN---- gpdabacsu ---------------------------------MKKVTMLGAGSWGTALALVLTDNGNEV . ::* . ... TRUNCATED Note that clustal uses the following symbols below each alignment block to highlight conserved residues: * (asterisk) positions that have a single and fully conserved residue : (colon) conserved: conservation between groups of strongly similar properties (score > 0.5 on the PAM 250 matrix) . (period) semi-conserved: conservation between groups of weakly similar properties (score ≤ 0.5 on the PAM 250 matrix) (blank) non-conserved
Notice the long N-terminal extension of the yeast sequence.
What do you think about the two MSAs, which one do you think is more likely correct, and why?
Seaview is a multi-platform, graphical user interface for multiple sequence alignment and molecular phylogeny.
Seaview can be run from the command line, and the options can be found with:
or here: seaview command manual
To visualize an alignment, simply call \(seaview\) with the alignment file name as single argument on the command line:
We will first attempt to perform a default MSA of nine full-length leuA sequences from diverse Bacillales using \(clustalw\)
Lets firs count the sequences, and have a look at them
9 >B_anthracis_Ames 30261500 atgaagcaaattttgttcatggatacgacgctacgtgatggcgaacaatcaccaggagtaaatttaaatgaacaagagaa attgcaaattgcaaggcaattagagagactagggattcatgtcatggaagctggctttgcagcagcttccgaaggggatt ttcaatcggtaaagcgcatcgcaaatacaattcaaaatgctacagttatgagtttagcaagagcgaaagaaagtgatatt cgaagagcatatgaagctgtgaaaggtgcggtatctcctcgtttacacgtatttttagctacgagtgatattcatatgaa gtataaactttgtatgtcaaaagaagatgtattagatagtatttatcgatcggttacacttgggaaatcgttatttccga cagtacaattttcagcagaagatgcaacaagaacagcaagagattttttagctgaagcggtagaagttgcgattcgtgca ggagcgaatgtaattaatattccagatacagttgggtatacgaatccagaagaatattatgctctttttaaatatttaca agaatccgttccttcatatgaaaaagcaattttctcttgtcattgtcatgatgatcttgggatggcggtagcaaattcat tagctgcagttgaaggcggagcgttgcaagtagaaggaacaattaatgggattggagaaagagcagggaatgcggcgtta gaagaggttgcagttgctcttcatattaggaaagatttctataaagcagagccttctatgacgctaaaagaaattaaagc gacaagtacattagtaagcagattaacgggtatggttgtatcgaaaaataaagcgattgttggagcaaatgcattcgctc atgaatcaggcattcatcaagatggtgtattaaaagaagtgacaacatatgaaattattgaaccagcgctagtaggcgaa tctcaaaatctatttgtacttggaaaacattctgggcgtcacgcgtttacagaaaaaatgaaagaacttggctatgaatt tacaaacgacgagcgggatgcagtatttgaagcatttaaaaaattagctgatcgtaagaaggaaattactgaggaagatt tacgagcacttatgcttggtgaagcagcatttgcggctcagcaatataacattacgcaattgcaagtacatttcgtatca aatagtacacaatgtgcgacagttgtactaaaagatgaggaaggaaatgtattcgaagatgcagctactggctctggtag tattgaagcgatttataatgcaatccaaagaattttaggattggaatgtgaattggcggattatcgcatacaatctatta cacaaggtcaagatgcactggctcatgttcatgttgaattaaaagaaggagcacatcaagtatcaggttttggtgttgca caagacgtattggaagcgtcggcaagagcatatgtccacgcagcggggaaattgaaatcttttatccagcttgtgaaata a ... truncated
Do you think the sequence shown above might be a complete CDS? Why?
Align the leuA sequences under default parameters:
# generate the MSA
clustalw -infile=leuA-Bacillales.fas -outfile=leuA-Bacillales.aln
# display the MSA
cat leuA-Bacillales.aln
... truncated B_anthracis_Ames TGTTGAATTAAAAGAA--GGA-GCACATCAAGTATCAGGTTTTGGTGTTG B_thuringiensis_konkukian TGTTGAATTAAAAGAA--GGA-GCACATCAAGTATCAGGTTTTGGTGTTG B_cereus_ATCC14579 TGTTGAATTAAAAGAA--GGT-AGTCACCAAATATCAGGTTTCGGAGTTG B_licheniformis_ATCC_14580 CGTCAGGGTG-ATGAT--CGATGGAAAAGAATCGGGAGGACGCGGGGTTG B_subtilis TGTAAGAGTT-TTGGT--GAACGGAAAAGAATCAGCAGGGCGGGGCATAG B_halodurans TGTTCAAATGGATTAT--CAA-GGAGTCGTCTCAAGTGGACGCGGCACGG O_iheyensis TGTTCAAATT-ATTAT--TAATGGAGAAACAATCAATGGTAGAGGGTCTG L_innocua TGTCGTCATTAAAGATGATAATGGCACTGAATTCCATGGTATCGGTATCG L_monocytogenes_4b_F2365 TGTCGTCATTAAAAATGACAAAGGCGCTGTGTTCCATGGTATCGGTATTG ** * ** ** * B_anthracis_Ames CACAAGACGTATTGGAAGCGTCGGCAAGAGCATATGTCCACGCAGCGGGG B_thuringiensis_konkukian CACAAGACGTATTGGAAGCGTCGGCAAGAGCATATGTCCACGCAGCGGGG B_cereus_ATCC14579 CGCAAGACGTATTGGAAGCGTCGGCGAGAGCTTATGTCCATGCAGCAGGG B_licheniformis_ATCC_14580 CCCAGGATGTTCTCGAAGCATCTGCCAAAGCCTACTTGAATGCTGTGAAC B_subtilis CGCAAGACGTATTAGAAGCATCAGCGAAAGCCTATTTGAACGCAGTAAAC B_halodurans CCCATGATGTGTTAGAAGCGTCAGCAAAAGCCTATTTAAACGCAGTTAAT O_iheyensis CTCAGGATGTGCTTCAAGCGTCTGCAGTAGCATTCATTCAGGCAGTAAAT L_innocua ATTTTGATGTTTTAACAGCTAGTGCTAAAGCTTATTTGCAAGCATCTGGA L_monocytogenes_4b_F2365 ATTTTGACGTTTTAACTGCTAGTGCTAAAGCTTATTTGCAAGCATCAGGC ** ** * ** ** *** * * * ** B_anthracis_Ames AAATTGAAATCTTTTATCCAGCTTGTGAAATAA----------------- B_thuringiensis_konkukian AAATTGAAATCTTTTATCCAGCTTGTGAAATAA----------------- B_cereus_ATCC14579 AAATTAAAATCTCTTATAGCGCTTGTGAAATAA----------------- B_licheniformis_ATCC_14580 CGCTATCTCGTATTGAAGACGAATACGGAAGGATTGTCAAAACAGGCGGC B_subtilis CGTCAATTGGTTTTCCAGTCGAATATGAGCGGATTGAAAAACCACACAGC B_halodurans CG--AACGATTAATCGAAAGAAATATGCAGAAGTTTATGGATCGAA---- O_iheyensis CGTTATTACGT--TCAAAAGAAA-----GCAAATC--TAACTCGACTTGT L_innocua AAAAGCAAATCTACAAGTAAGCA----AGCTGATTTCGAGGAGGTAAAGT L_monocytogenes_4b_F2365 AAAAGCAAAACCGCCAGTAAGCA----AGCTGATTTCGAGGAGGTAAAAT ... truncated
Carefully scrutinize the MSA shown above. What do you think about the alignment’s quality? Provide your comments.
The correct way of aligning CDSs requires their translation, and alignment at the protein level, as shown in the following figure:
Here we assume that all CDSs in the FASTA file are in the same +1 frame.
Lets visualize the translated products:
>B_anthracis_Ames 30261500 MKQILFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRI ANTIQNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLDS IYRSVTLGKSLFPTVQFSAEDATRTARDFLAEAVEVAIRAGANVINIPDTVGYTNPEEYY ALFKYLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAAL EEVAVALHIRKDFYKAEPSMTLKEIKATSTLVSRLTGMVVSKNKAIVGANAFAHESGIHQ DGVLKEVTTYEIIEPALVGESQNLFVLGKHSGRHAFTEKMKELGYEFTNDERDAVFEAFK KLADRKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVFED AATGSGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGAHQVSGFGVA QDVLEASARAYVHAAGKLKSFIQLVK* >B_cereus_ATCC14579 30019549 LFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRIANTI QNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLNSIHRS VTLGKSLFPTVQFSAEDATRTSRAFLAEAVEIAIRAGANVINIPDTVGYTNPEEYYSLFE YLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAALEEVA VALHIRKDFYKAEPSMTLKEIKATSILVSRLTGMVVPKNKAIVGANAFAHESGIHQDGVL KEVTTYEIIEPALIGESQNLFVLGKHSGRHAFTEKMKELGYEFTNEERDAAFEAFKKLAD RKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVYEDAATG SGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGSHQISGFGVAQDVL EASARAYVHAAGKLKSLIALVK* ... truncated
Note that the stop codons were translated as ’*’ symbols. These are not amino acids, and should be removed. This can be easily achieved with a simple \(sed\) one-liner:
# use sed to replace the terminal * symbols by nothing
sed 's/\*$//' Bacillales_translated.faa > k && mv k Bacillales_translated.faa
>B_anthracis_Ames 30261500 MKQILFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRI ANTIQNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLDS IYRSVTLGKSLFPTVQFSAEDATRTARDFLAEAVEVAIRAGANVINIPDTVGYTNPEEYY ALFKYLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAAL EEVAVALHIRKDFYKAEPSMTLKEIKATSTLVSRLTGMVVSKNKAIVGANAFAHESGIHQ DGVLKEVTTYEIIEPALVGESQNLFVLGKHSGRHAFTEKMKELGYEFTNDERDAVFEAFK KLADRKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVFED AATGSGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGAHQVSGFGVA QDVLEASARAYVHAAGKLKSFIQLVK ... truncated
Note that it is critical to conserve the original ordering of the sequences from the input file in the output MSA using the \(-OUTORDER=input\) option.
Display both the correct codon and default MSAs of the leuA CDSs:
# this one is the correct codon/CDS alignment
seaview leuA-Bacillales_cdnaln.fas &
# this one is the default, incorrect codon/CDS alignment
seaview leuA-Bacillales.aln &
The output shown above clearly demonstrates that when aligning CDSs, we should first translate them to proteins, perform the alignment of the translation products, and finally reconstitute the CDS alignment using the protein MSA as a guide or template.
Clustal\({\Omega}\) is the last incarnation of the Clustal family of MSA programs. It offers a significant increase in scalability over previous versions thanks to the implementation of novel algorithms that avoid bottlenecks of the classic progressive multiple sequence alignment algorithms (Sievers et al. 2011).
Clustalw, and earlier Culstal versions, use fast word-based alignments (k-mers) for these comparisons, which made it memory efficient and fast enough to work on personal computers. However, as the number of sequences grows above a few thousand, the \(O(N^2)\) complexity associated with the computation of pairwise distance matrices to generate “guide trees” becomes very time consuming.
\(Clustal\Omega\) implements an \(O(NlogN)\) method called \(mBed\) (Sievers et al. 2011) that allows guide trees of hundreds of thousands of sequences to be computed by restricting the calculation of sequence alignment scores to NLog(N). This is the main algorithmic improvement providing Clustal Omega with the capacity and scalability required for current genomic data sets.
The second main development in \(Clustal\Omega\) was the replacement of the conventional dynamic programming, and profile alignment, with an alignment engine provided by \(HHalign\) capable of aligning profile hidden Markov models (HMMs). HMM-HMM alignments had been shown to have very high accuracy for profile HMM alignment (Söding 2005). The use of \(HHalign\) provides \(Clustal\Omega\) with greatly increased accuracy when compared to earlier Clustal programs, as measured on structure based alignment benchmarks.
This is the paper describing Clustal\({\Omega}\), along with its abstract: Sievers et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11:7:539. doi: 10.1038/msb.2011.75
ABSTRACT Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
We will repeat the MSA of the two FASTA files containing four bacterial and six eukaryotic GDP protein sequences using \(clustalo\), called from a Bash \(for\) loop to align all *ualn.faa files found in the current directory.
This \(clustalo\) call performs a profile-profile alignment of the two input alignments.
This \(clustalo\) call performs a sequence (4_GDP_procar_ualn.faa) to profile (6_GDP_eucar_ualn_cluOaln.phy) alignment.
The \(paste\) command allows us to conveniently display the two MSAs side by side for their comparision.
Several differences can be seen between the p2p and s2p GDP alignments shown above. Which one do you thinks is the best?
To gain evidence in favor for one or the other, we will perform a new \(clustalo\) alignment on the full 10 GDP sequence set using iterations and Kimura correction.
Iterations can be used in an attempt to improve the alignment by a process of repeated refinement (Sievers & Higgins, 2014).
Iterative refinement can be accomplished in two ways:
This process is illustrated in the bottom part of Figure shown below (Sievers & Higgins, 2014).
The second iteration scheme uses a profile HMM, derived from an alignment of the input sequences, as an external profile. The first alignment gives an outline of the positions where residues are more likely to align in a second alignment round.
Both iteration methods can be carried out a number of times, separately or in combination. The default is to combine them, as will be shown in the protocol below.
# 1. combine the source FASTA files into a single one named 10_GDP_procar_eukar_ualn.fas
cat 4_GDP_procar_ualn.faa 6_GDP_eucar_ualn.faa > 10_GDP_procar_eukar_ualn.fas
# 2. Perform clustalo MSA with two iterations and using Kimura-corrected distances
clustalo -i 10_GDP_procar_eukar_ualn.fas --iter 2 --use-kimura -o 10_GDP_eucar_prokar_cluoAln2iter.aln --outfmt clu --threads 4
Clustal Omega will first construct a guide tree and align the sequences in the order specified by the guide tree. This alignment is then used in two ways:
During the second round of alignment (the first iteration), sequences are aligned to the internally constructed profile HMM, pseudo-count information is transferred to the unaligned sequences, and the ‘enhanced’ sequences are aligned to each other in the order specified by the newly constructed guide tree. The final alignment is written in clustal format to 10_GDP_eucar_prokar_cluoAln2iter.aln .
The first iteration will be followed by another guide-tree construction and HMM pseudo-count transfer.
While the alignment accuracy is expected to be higher for the iterated alignments, so is also the response time. Every round of iteration can take up to three times longer than the initial alignment. For example, if the initial alignment took 1 min, then an alignment that was iterated twice might take up to 7 min; that is, 1 min for the initial alignment, and then two times 3 min for the iterations.
Unfortunately, experience has shown that the alignment accuracy cannot be improved indefinitely by iteration. Good improvements are often achieved with one or two iterations (Sievers & Higgins, 2014).
Let us compare the standard and iterated alignments of the 10_GDP_eucar_prokar sequences:
As shown above, there are minimal differences in the p2p vs. the iteration alignment with Kimura (corrected) distances performed on the full set of 10 GDP sequences. Can you spot them?
Based on the high similarity of both \(Clustal\Omega\) alignments, we might conclude that the p2p and the full alignment with two iterations are better than the s2p alignment presented in a previous section.
As discussed previously, CDSs should be aligned using their aligned translation products as a guide. Here is how to do it using \(clustal\Omega\):
# 1. Translate the leuA-Bacillales.fas file with the aid of the translate_fastas.pl Perl script
translate_fastas.pl -e fas -t 11
# 2. Remove the trailing *'s resulting from the translation of stop codons present in the DNA sequences
sed 's/\*$//' Bacillales_translated.faa > k && mv k Bacillales_translated.faa
# 3. Align the tranlation products keeping the input order
clustalo -i leuA-Bacillales_translated.faa -o leuA-Bacillales_translated_cluo.faa --output-order input-order --threads 4
# 4 Use the prot2cdnAln.pl Perl script to generate the codon alignment using the protein alignment as a guide
prot2cdnAln.pl leuA-Bacillales.fas leuA-Bacillales_translated_cluo.faa
Finally, we will visualize the resulting codon alignment of the leuA CDSs at the DNA and protein levels.
As previously introduced, a key feature of \(Clustal\Omega\) is that it can take an external profile hidden Markov model of a protein family, and use it to aid in the alignment highly divergent members.
As mentioned by Sievers & Higgins (2014), the main problem with progressive multiple sequence alignment is that alignments that were made during the early stages of the MSA cannot be changed at a later stage. This has been summarized by the phrase “once a gap, always a gap”. Therefore, it would be very advantageous if the algorithm could be provided with some degree of ‘foresight’ to know in advance which alignment columns are conserved and important in the alignment (Sievers & Higgins, 2014). Some foresight can be communicated to \(Clustal\Omega\) in the form of an external, previously contstructed profile hidden Markov model (profile HMM) of the protein family to be aligned.
The use of profile HMMs is particularly natural within \(Clustal\Omega\), as its main alignment algorithm is based on alignment of profile HMMs (Söding, 2005; Sievers 2011). A profile HMM can be supplied to \(Clustal\Omega\) as an ‘external profile’. The alignment of the unaligned residues will be guided by the distribution of residues and gaps in the HMM. Residues of the unaligned sequences will be aligned to the positions within the HMM that best fit the signature, and this is used to align those residues more accurately.
Obviously, the accuracy of the HMM will affect the quality of the HMM-guided MSA. However, experience has shown that external profile HMM information is rarely detrimental (Sievers et al., 2011). Use of HMMs as external profiles is therefore very robust. Its use is most helpful in the alignment of highly diverged sequences with low average pairwise sequence similarity. Such scenarios frequently lead to early misalignment and/or gaps, which can be partly avoided using background information provided by the HMM.
This procedure can be used if an existing multiple alignment of a set of sequences that has been carefully maintained is available. It can then be converted to a HMM and used to help align new sequences. Alternatively, HMMs can be downloaded from alignment databases such as PFAM. This procedure is also used to iterate the multiple alignment process, as described in the previous section.
As a final exercise, we will align the classic set of 146 human Ras GTPases compiled by Wennerberg et al. (2005) with the aid of the PFAM00071.hmm model for the Ras superfamily, fetched from PFAM. Note that on the chaac server the alignment takes ~5 seconds to complete using 4 cores, which is almost twice the time it takes to align under the default \(clustalo\) parameterization, as demonstrated below:
# default clustalo alignment
time clustalo -i Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa -o Human_Ras_Superfam_proteins_Wennerberg_JCS05_cluo.afa --outfmt fasta --threads 4
# clustalo using the Ras_fam_PF00071.hmm as HMM input
time clustalo -i Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa --hmm-in Ras_fam_PF00071.hmm -o Human_Ras_Superfam_proteins_Wennerberg_JCS05_PF00071.afa --outfmt fasta --threads 4
# default clustalo alignment real 0m2.912s user 0m7.995s sys 0m0.136s # clustalo using the Ras_fam_PF00071.hmm as HMM input real 0m5.032s <== just ~ 5 seconds with 4 threads user 0m14.801s sys 0m0.151s
Below you’ll see a screenshot of the central part of the HMM-guided alignment, the region modeled by the PFAM00071 profile, for a fraction of the 146 sequences, and displayed with \(seaview\).
# visualize the PF00071.hmm-assisted alignment
seaview Human_Ras_Superfam_proteins_Wennerberg_JCS05_PF00071.afa &
The alignment shown above was used to compute the maximum likelihood tree for the Ras superfamily presented below (Vinuesa, unpubl.)
In the HMMER section that follows, we will focus on the Rab family of small GTPases, the largest one within the Ras superfamily.
\(Seaview\) can be used to easily
extract certain alignment regions or columns under ‘Sites > Create
Set > OK’. Then left-click on the white ‘X’ label immediately
adjacent to the first conserved alignment column that you would like to
keep, followed by a right click on the preceding ‘X’. This will exclude
the N-terminal region flanking your region of interest (ROI). Next, move
to C-terminal part of your ROI and repeat the left- and right-clicks to
exclude the non-desired C-terminal region. Then type CTRL-A
to select all sequences and save them under ‘Save > Save selection’.
Note that you can un-select particular sequences before saving them to
disk by left-clicking on their IDs.
The extracted alignment blocks can then be used to compute sequence logos and simple profiles that summarize the residue frequency distribution in such blocks using ad hoc Perl and R scripts, or Web-servers such as WebLogo 3.
Below the logo and motif representations of the first highly conserved (G1) region of Ras GTPases, including non-gaped flanking columns.
A simpler, less informative way to represent these highly conserved signatures is by a simple sequence motif summarizing the pattern, as shown below:
# sequence motif for the highly conserved G1 region of Ras GTPases with max. degeneracy set to 4 xxxxxxxx[GAE]xxxx[GAV][KT][TSGK]xxxxxxxxx
Perform the following tasks, documenting the commands used, and the results obtained, in a fashion similar to that used in previous sections of this tutorial.
Important note: all \(clustalo\)
alignments should be run with four threads using option –threads
4