1 About the tutorials
- 1.1 ASSUMPTIONS
  - 1.1.1 RELATED RESOURCES
- 1.2 AUTHOR
  - 1.2.1 SOURCE REPOSITORY
  - 1.2.2 LICENSE
2 THE BLAST SUITE OF PROGRAMS - OVERVIEW AND PROGRAM SELECTION
3 Karlin-Altschul Statistics: the blast E-value for an HSP
4 PRELIMINARIES - setting up working directories
5 Searching a DNA database with DNA queries (BLASTN)
- 5.1 EXERCISE 1 - classify novel 16S rDNA sequences at the genus level using blastn
6 BLAST+ with protein queries against a protein database (BLASTP)
7 Translating BLAST ($BLASTX$) - translate a DNA sequence to search a protein database
- 7.1 EXERCISE 5: Identification of CDSs in DNA sequences using $BLASTX$
- 7.2 Assign functional annotations to the 3 CDSs encoded in the Am1 class-I integron gene cassettes
  - 7.2.1 HOMEWORK 7
8 Recommended readings

1 About the tutorials

The following tutorials teach how to run NCBI [legacy-blast and] BLAST+ from the command line, along with useful AWK, Bash, Perl and R code and idioms, including 1-liner programs and fully fledged scripts to parse and analyze results efficiently. Particularly helpful is the advanced compute_RBH_clusters.sh script.

1.1 ASSUMPTIONS

This tutorial expects that the user has access to a Unix/Linux environment with the standard Bash shell installed, as well as the [legacy- and] blast+ software suite(s), which are freely available for download from the NCBI FTP server

For further instruction on installing and using the BLAST+ suite, read NCBI’s BLAST Book and NCBI’s The Statistics of Sequence Similarity Scores tutorial.

1.2 AUTHOR

These tutorials were designed and written by Pablo Vinuesa, CCG-UNAM, Cuernavaca, Mexico (X - @pvinmex), for the International Workshops on Bioinformatics, TIB2024, his courses taught at the Bachelor’s Program in Genome Science, LCG-UNAM, and the Workshop on Genome Sciences, Facultad de Ciencias UNAM.

1.2.1 SOURCE REPOSITORY

This material is distributed from the following TIB-filoinfo GitHub repository.

1.2.2 LICENSE

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0

2 THE BLAST SUITE OF PROGRAMS - OVERVIEW AND PROGRAM SELECTION

BLAST finds regions of similarity between a query and subject sequences. The program compares nucleotide or protein query (problem) sequences to sequence databases and calculates the statistical significance of the resulting query-subject pairwise alignments expressed as an Expect (E-value).

To run a BLAST SEARCH we must select the proper blast program based on the nature of our query sequence(s) and that of the database to search for homologs.

query	database	program	note
DNA	DNA	blastn	-
PROT	PROT	blastp	-
DNA	PROT	blastx	The DNA query is translated in the 6 frames and the products searched against protein DB
PROT	DNA	tblastn	The DNA DB (genome) is translated in the 6 frames and queried with protein sequences
DNA	DNA	tblastx	The DNA query and DNA DB are both translated in the 6 frames and searches performed at the protein level

Table 1. Five frequently used BLAST programs.

Figure 1. The four standard BLAST programs most frequently used for database searching.

Query sequences are stored in a FASTA-formatted file. In addition to the programs listed in table 1, the blast+ software suite also provides the makeblastdb and blastdbcmd programs to make custom DNA or protein databases (makeblastdb) and retrieve the hits from the corresponding blast database found in a blast search (blastdbcmd).

The complete set of blast+ binaries in the ncbi-blast-2.14.1+-x64-linux.tar.gz release are:

blastdb_aliastool
blastdbcheck
blastdbcmd
blast_formatter
blast_formatter_vdb
blastn
blastn_vdb
blastp
blast_vdb_cmd
blastx
cleanup-blastdb-volumes.py
convert2blastmask
deltablast
dustmasker
get_species_taxids.sh
legacy_blast.pl
makeblastdb
makembindex
makeprofiledb
psiblast
rpsblast
rpstblastn
segmasker
tblastn
tblastn_vdb
tblastx
update_blastdb.pl
windowmasker

3 Karlin-Altschul Statistics: the blast E-value for an HSP

To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone.

In 1990, Samuel Karlin and Stephen Altschul published a paper describing the statistics underlying local alignment scores based on the hit expectancy value or E-value.

\[ E = kmne^{-{\lambda}S} \] This equation states that the number of alignments expected by chance ($E$) during a blast search for a given high-scoring pair (HSP) with a raw score ($S$), is a function of the size of the search space (effective query sequence length $m$ * effective length of the database $n$ (in number of residues)), the normalized score of the hit (${\lambda}*S$), and a minor constant $k$.

An $E-value$ reports how many alignments with a given score are expected by chance
A $P-value$ tells how often one can expect to see such an alignment by chance

The following formula provides the relationship between both values:

\[P = 1 -e^{-E}\]

Note that for values $<0.001$, the E-value and the P-value for a given HSP are identical, as we will demonstrate graphically in the section “Using base R to explore BLAST tabular output”.

Read the handy and straightforward NCBI’s Statistics of Sequence Similarity Scores tutorial to learn more about KA-statistics.

I have also found these scribes on BLAST and Karlin-Altschul Statistics from Brown University useful.

4 PRELIMINARIES - setting up working directories

The following code will set up our working directories for the tutorials and fetch the required sequences.

# i) Download or copy the sequence data to your working directories for this practice
cd && [ ! -d sesion2_blast ] && mkdir sesion2_blast && cd sesion2_blast

# ii) use wget on the command line to fetch the required files from my TIB-filoinfo GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/16S_4blastN.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/gene_discovery_and_annotation_using_blastx.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/UniProt_proteomes.tgz

# iii) Generate two subdirectories to hold the data and analysis results
mkdir BLAST_DB_AA BLAST_DB_NT UniProt_proteomes

# iv) Save the path to our parental working directory in a variable
wkdir=$(pwd) 
echo $wkdir

# v) Move each *tgz file to the proper dir
mv 16S_4blastN.tgz BLAST_DB_NT
mv UniProt_proteomes.tgz UniProt_proteomes
mv gene_discovery_and_annotation_using_blastx.tgz BLAST_DB_AA

We are set to start the hands-on tutorials on running BLAST from the command line.

5 Searching a DNA database with DNA queries (BLASTN)

5.1 EXERCISE 1 - classify novel 16S rDNA sequences at the genus level using blastn

The exercise aims to teach the basics of a BLAST analysis. We will generate an indexed BLAST database of nucleotide sequences and interrogate it using blastn.

We will use the following data for the exercise.

Query sequences: 16S_problema.fna, correspond to partial 16S rDNA sequences generated in our lab from environmental microbes recovered from contaminated soils and rivers in Morelos, Mexico. Part of these sequences are reported in Ocha-Sánchez & Vinuesa, 2017.
Reference sequences to build a blastn-searchable DB: 16S_seqs4_blastDB.fna. This is a tiny portion of the bacterial type strain 16S rRNA sequences fetched from RDPII.

The objective of the exercise is to classify the query sequences in 16S_problema.fna at the genus level based on BLAST statistics.

5.1.1 Exploring the query and reference sequences

Move into the directory holding the nucleotide sequences and explore both files using standard shell filtering tools and modify the FASTA headers in 16S_seqs4_blastDB.fna to make them suitable for formatdb|makeblastdb indexing.

# We need to untar & uncompress (gunzip) the compressed tar file
cd BLAST_DB_NT/ && tar -xvzf 16S_4blastN.tgz

# 1) explore the fasta headers: 
grep '>' 16S_seqs4_blastDB.fna |less

# 1.1) How many genera and species per genus are found in the source file 16S_seqs4_blastDB.fna?
# i. the genera
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1 | sort | uniq -c | sort -nrk1

# ii. the species
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1,2 | sort | uniq -c | sort -nrk1

5.1.2 Using Perl 1-liners to generate FASTA headers suitable for database indexing with makeblastdb

# 1.2) Are the sequences in a suitable format for makeblastdb indexing?
#      What command would you use to inspect the sequences?

# 1.2.1)  generate a proper FASTA header for db indexing
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna | grep '^>' 
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna > 16S_seqs4_blastDB.fnaed

5.1.3 Running [formatdb]|makeblastdb to generate an indexed blast database

Note: The commented command lines in the following blocks provide the code to run legacy blast. The exercises are demonstrated using the blast+ suite of programs.

You can display the compact or detailed help formats of any blast+ program using blastprogramx -h or blastprogramx -help, respectively.

#formatdb -i 16S_seqs4_blastDB.fnaed -p F -o T

# display the compact help format
makeblastdb -h

# display the detailed help format
makeblastdb -help

# run makeblastdb to generate an indexed database of nucleotide sequences
makeblastdb -in 16S_seqs4_blastDB.fnaed -dbtype nucl -parse_seqids

5.1.4 Running a default blastn search

The simplest blastn (or any other standard blast search program) invocation requires only two parameters: blastn -query <FASTA_file> -db <database_file>.

The following blastn call will search the 16S_seqs4_blastDB.fnaed database we generated in the previous section with the 16S_problema.fna query sequences, using all default values of the search parameters, as defined in the blastn -help output. This includes the standard output format familiar to users of the Web implementation of NCBI-BLAST.

The output (truncated to save space) has the following sections:

header
one-line hit descriptions
alignments
footer

blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed

BLASTN 2.14.1+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: 16S_seqs4_blastDB.fnaed
           604 sequences; 880,434 total letters



Query= 13

Length=1359
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

16S_DB:480 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928        2475    0.0  
16S_DB:573 Mycobacterium_arupense_T|AR30097|DQ157760                  2453    0.0  
16S_DB:501 Mycobacterium_hiberniae_T|ATCC_9874|X67096                 2407    0.0  
16S_DB:592 Mycobacterium_senuense_T|05-832|DQ536408                   2355    0.0  
16S_DB:583 Mycobacterium_kumamotonense_T|CST7274_(=_GTC2729)|AB23...  2350    0.0  
16S_DB:579 Mycobacterium_pallens_T|czh-8|DQ370008                     2252    0.0  
16S_DB:531 Mycobacterium_asiaticum_T|ATCC_25276|AF480595              2244    0.0

... TRUNCATED


>16S_DB:480 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
Length=1466

 Score = 2475 bits (1340),  Expect = 0.0
 Identities = 1352/1359 (99%), Gaps = 2/1359 (0%)
 Strand=Plus/Plus

Query  1     GTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCATCTTGGGATAA  60
             |||||||||||||||||||||||||||||||||||||||||||||||||  |||||||||
Sbjct  48    GTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCACTTTGGGATAA  107

Query  61    GCCTGGGAAACTGGGTCTAATACCGAATAGGACCGCATGCTGCATGGTGTGTGGTGGAAA  120
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  108   GCCTGGGAAACTGGGTCTAATACCGAATAGGACCGCATGCTGCATGGTGTGTGGTGGAAA  167

Query  121   GCTTTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTA  180
             || |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  168   GC-TTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTA  226

Query  181   CCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATAC  240
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  227   CCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATAC  286


... TRUNCATED

>16S_DB:102 Aeromonas_enteropelogenes_T|DSM_6394|X71121
Length=1490

 Score = 920 bits (498),  Expect = 0.0
 Identities = 970/1204 (81%), Gaps = 23/1204 (2%)
 Strand=Plus/Plus

Query  153   ATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGACGACGGGTAGCCGGCCTGAGAGG  212
             || |||| ||||||| |||||||||  |||||||||||||    |||| || ||||||||
Sbjct  235   ATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGG  294

Query  213   GTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGG  272
              ||  | |||||||||| ||||||| |||| |||||||||||||||||||||||||||||
Sbjct  295   ATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGG  354

Query  273   AATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGACGGCCTTC  332
             |||||||||||||||| | || ||||||||||| | |||||||| | || || |||||||
Sbjct  355   AATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTC  414

Query  333   GGGTTGTAAACCTCTTTCAGTATCGGCG-AAGCTCCA-AGGT--TTTCT-TTGG-GGTGA  386
             |||||||||| | |||||||    |  | ||| ||   || |  | |||  |||  ||||
Sbjct  415   GGGTTGTAAAGCACTTTCAGCGAGGAGGAAAGGTCAGTAGCTAATATCTGCTGGCTGTGA  474

Query  387   CGGTAGGTACAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGG  446
             || ||    |||||||||||||||| |||| |||||||||||||||||||||||| ||||
Sbjct  475   CGTTACTCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGG  534

... TRUNCATED


Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 


Effective search space used: 946715250


  Database: 16S_seqs4_blastDB.fnaed
    Posted date:  Aug 6, 2022  3:00 PM
  Number of letters in database: 880,434
  Number of sequences in database:  604


Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5

Note that the header and footer sections of the standard blast output contain essential information (m, n, k, $\lambda$) required to compute the E-value (see Karlin-Altschul equation $E = kmne^{-{\lambda}S}$**, as detailed in a previous section).

5.1.4.1 AWK one-liner to extract key statistical parameters of a blast[npx] run

The following little AWK one-liner helps extract key statistical parameters from the standard output of a blast[npx] run. The AWK program has the following structure: /pattern/, /pattern/; /pattern/, /pattern/{action}{action}.

blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -num_alignments 1 | awk '/Database/, /total letters/; /^Lambda/, /Extension:/{seen++; print}{ if(seen>5) exit}'

Database: 16S_seqs4_blastDB.fnaed
           604 sequences; 880,434 total letters
Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850

5.1.5 Identify the single best-hit for each query sequence

A standard blastn search output can be lengthy and excessively wordy. For the aim of the exercise, it is more convenient to use a tabular output format reporting only the best hit found in the database, as we want to classify our unknown 16S rDNA sequences based on their best database hit.

This can be easily achieved with the following blastn call:

# 2.1) run blastn (blastall -p blastn), and get only the best hit (-b 1) in tabular format (-m 8).
# blastall -p blastn -i 16S_problema.fna -b 1 -a 6 -d 16S_seqs4_blastDB.fnaed -m 8 > 16S_problema_blastN_b1_m8.tab
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt 6 -num_alignments 1 > 16S_problema_blastN_maln1_m6.tab

# 2.2) explore the table structure
head 16S_problema_blastN_maln1_m6.tab

13  16S_DB:480  99.485  1359    5   2   1   1359    48  1404    0.0 2475
18  16S_DB:494  99.033  1344    10  3   1   1344    49  1389    0.0 2416
2   16S_DB:549  99.552  1340    5   1   4   1343    81  1419    0.0 2440
27  16S_DB:546  100.000 1342    0   0   1   1342    78  1419    0.0 2479
36  16S_DB:480  99.705  1358    3   1   1   1358    48  1404    0.0 2490
4   16S_DB:600  99.335  1354    4   4   1   1353    79  1428    0.0 2446
48  16S_DB:484  98.375  1354    20  1   2   1355    48  1399    0.0 2390
53  16S_DB:548  99.925  1342    1   0   1   1342    78  1419    0.0 2473
62  16S_DB:496  99.106  1342    8   4   1   1342    85  1422    0.0 2409
7   16S_DB:600  98.736  1345    14  3   1   1343    79  1422    0.0 2386

As can be seen in the output above, the -outfmt 6 tabular format has no header.

5.1.5.1 The twelve fields of a standad tabular blast output (-outfmt 6|[-m 8])

The default -outfmt 6 blast output contains the following 12 fields

# The default m6 blast output fields
#=====================================================================================
# >>> The default -outfmt 6 | -m 8 blast output contains the following 12 fields   <<<     
# qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
#-------------------------------------------------------------------------------------
# 1.     qseqid      query (e.g., gene) sequence id
# 2.     sseqid      subject (e.g., reference genome) sequence id
# 3.     pident      percentage of identical matches
# 4.     length      alignment length
# 5.     mismatch    number of mismatches
# 6.     gapopen     number of gap openings
# 7.     qstart      start of alignment in query
# 8.     qend        end of alignment in query
# 9.     sstart      start of alignment in subject
#10.     send        end of alignment in subject
#11.     evalue      expect value
#12.     bitscore    bit score
#-------------------------------------------------------------------------------------

5.1.6 Adding a header to the standard tabular output: -outfmt 6 [-m 8 in legacy blast]

# 2.3 Add a header to the standard output; print to file and append to 16S_problema_blastN_maln1_m6.tab
echo -e "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore" > standard_blast_fields.tsv
cat standard_blast_fields.tsv 16S_problema_blastN_maln1_m6.tab > t && mv t 16S_problema_blastN_maln1_m6.tab

head -20 16S_problema_blastN_maln1_m6.tab | column -t

# 2.4) how many unique hits are there?
cut -f 2 16S_problema_blastN_maln1_m6.tab | sed '1d' | sort | uniq -c

5.1.7 Retrieve all the hits from the database using $blastdbcmd$

#   Generate the list of hit IDs for each query
#      NOTE: the hits are in the same order as the query sequences
cut -f2 16S_problema_blastN_maln1_m6.tab | sed '1d' > IDs4blastdbcmd.txt

#blastdbcmd -i IDs4blastdbcmd.txt -d 16S_seqs4_blastDB.fnaed | grep '>' |cut -d\| -f3 | cut -d' ' -f2 > hit_sequences_b1.fna
blastdbcmd -entry_batch IDs4blastdbcmd.txt -db 16S_seqs4_blastDB.fnaed > best-hit_sequences.fna

5.1.8 Generate a table containing the query sequence IDs, the subject headers and basic alignment statistics using Linux tools

# 1) Generate a list of the hits from the FASTA headers
grep '>' best-hit_sequences.fna | cut -d' ' -f2 > hit_sequence_headers.txt

# 2) Classify our query sequences, based on the subject headers of each best hit, using a simple tabular output format
cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d' > problem_seqs.txt
paste problem_seqs.txt hit_sequence_headers.txt > classified_16S_problema_sequences.tab

head -5 classified_16S_problema_sequences.tab

13  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18  Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27  Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928

5.1.9 Generate a table containing the query sequence IDs, the subject headers, and basic alignment statistics using process substitution to avoid writing intermediate files

A more efficient and idiomatic (“Bashish”) code to achieve the same result, without having to write intermediate files, is the use of “process substitution”, which has this general syntax: <(any comamnd(s) or pipeline here). Process substitution replaces the command with its output, treating it as if it were stored in a file. This output can therefore be used by commands like $cat$, $paste$, or any other that works on files. Let us see it in action:

paste <(cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head | column -t

13  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18  Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27  Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48  Mycobacterium_gordonae_T|ATCC_14470.|X52923
53  Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62  Mycobacterium_chitae_T|ATCC_19627|X55603
7   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
... TRUNCATED

5.1.10 Add basic BLAST statistic information to the summary table using Bash’s “process substitution” idiom

The previous result looks great but needs to include critical statistical information about the robustness of the hits. Let us use the “process substitution” trick that we learned in the previous example to add extra fields from the blast output table (pident, length, e-value, bitscore) to our summary table, including a header.

# 3) add also columns 3, 4, 11, 12 (pident, length,e-value & bit score) to the tabular output, including a header
cat <(echo -e "sid\tpid\tlen\tEval\tscore\ttax") <( paste <(cut -f1,3,4,11,12 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head) | column -t

sid  pid      len   Eval  score  tax
13   99.485   1359  0.0   2475   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18   99.033   1344  0.0   2416   Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2    99.552   1340  0.0   2440   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27   100.000  1342  0.0   2479   Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36   99.705   1358  0.0   2490   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4    99.335   1354  0.0   2446   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48   98.375   1354  0.0   2390   Mycobacterium_gordonae_T|ATCC_14470.|X52923
53   99.925   1342  0.0   2473   Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62   99.106   1342  0.0   2409   Mycobacterium_chitae_T|ATCC_19627|X55603
7    98.736   1345  0.0   2386   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
... TRUNCATED

Note: we are using Linux commands to achieve a task that can be performed directly within BLAST, using a customized tabular output, as we will see in the final section on BLASTN, and explained in the blast[n|p] -help.

5.1.11 Filter out high-confidence classifications

It is important to realize that finding a hit in the database does not necessarily mean that it supports a reliable identification or classification of the query sequence. In order to have a reasonably high confidence in the classification of 16S rDNA sequences at the genus level, based on the best $blastn$ hit, it should at least satisfy the following conditions:

alignment length >= 500 bp
pident >= 95%

5.1.11.1 AWK is a very handy and efficient tool for filtering BLAST output tables

Let us first explore the BLAST table to identify hits that do not satisfy the two criteria defined above (pident >= 95% && length >= 500). $AWK$ is ideal for this task.

5.1.11.1.1 identify hits with < 95% identity OR short alignments (length < 500) using $AWK$ patterns

# we could use any of the following lines to print the results, including a header
# cat <(cat standard_blast_fields.tsv) <(awk 'BEGIN{FS=OFS="\t"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab) | column -t
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab | column -t

qseqid                            sseqid      pident   length  mismatch  gapopen  qstart  qend  sstart  send  evalue    bitscore
1367SAA005_14Cip-FD1.b1.ab1_638   16S_DB:86   93.218   634     39        4        1       632   37      668   0.0       929
1367SAA005_A17Cip-FD1.b1.ab1_171  16S_DB:86   95.930   172     6         1        1       171   35      206   2.51e-76  278
1367SAA005_A6BSm-FD1.b1.ab1_477   16S_DB:86   98.745   478     5         1        1       477   37      514   0.0       848
1367SAA005_S13-FD1.b1.ab1_121     16S_DB:534  88.333   120     11        3        3       121   64      181   2.33e-35  141
1367SAA005_S2CbTm-FD1.b1.ab1_84   16S_DB:269  97.647   85      1         1        1       84    47      131   1.15e-36  145
1367SAA005_S4CbTm-FD1.b1.ab1_172  16S_DB:269  100.000  172     0         0        1       172   52      223   1.49e-88  318
1367SAA005_S6Sm-FD1.b1.ab1_347    16S_DB:184  99.143   350     0         3        1       347   40      389   0.0       627
1367SAA005_S9b-FD1.b1.ab1_142     16S_DB:3    97.887   142     3         0        1       142   40      181   1.60e-67  248
1367SAA005_Tm203-FD1.b1.ab1_491   16S_DB:128  97.942   486     9         1        3       487   52      537   0.0       841
1367SAA005_VNP1-FD1.b1.ab1_578    16S_DB:435  94.965   576     28        1        3       578   31      605   0.0       902

5.1.11.1.2 generate the final (filtered) results table with highly confident classifications

In order to get a confident blastn-based classification of our problem sequences at the genus level, we should use only hits with pident >= 95% and a minimal alignment length >= 500 bp.

paste <(awk 'BEGIN{FS=OFS="\t"}NR > 1{print $1,$3,$4,$11,$12}' 16S_problema_blastN_maln1_m6.tab) <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | awk 'BEGIN{FS=OFS="\t"; print "sid\tpid\tlen\tEval\tscore\tbest_hit"}$2 >= 95 && $3 >= 500' > 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv

Display the table’s head and tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv

cat <(head 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) <(tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) | column -t

sid                                   pid      len   Eval  score  best_hit
13                                    99.485   1359  0.0   2475   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18                                    99.033   1344  0.0   2416   Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2                                     99.552   1340  0.0   2440   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27                                    100.000  1342  0.0   2479   Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36                                    99.705   1358  0.0   2490   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4                                     99.335   1354  0.0   2446   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48                                    98.375   1354  0.0   2390   Mycobacterium_gordonae_T|ATCC_14470.|X52923
53                                    99.925   1342  0.0   2473   Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62                                    99.106   1342  0.0   2409   Mycobacterium_chitae_T|ATCC_19627|X55603
1367SAA005_S11Sm-FD1.b1.ab1_1049      98.763   1051  0.0   1866   Pseudomonas_monteilii_T|CIP_104883|AF064458
1367SAA005_S12-FD1.b1.ab1_899         98.210   894   0.0   1559   Microbacterium_testaceum_T|DSM_20166|X77445
1367SAA005_S15-FD1.b1.ab1_1111        98.115   1114  0.0   1936   Microbacterium_testaceum_T|DSM_20166|X77445
1367SAA005_S1CbTm-FD1.b1.ab1_683      99.227   647   0.0   1166   Pseudomonas_monteilii_T|CIP_104883|AF064458
1367SAA005_S1-FD1.b1.ab1_542          99.815   542   0.0   996    Pseudomonas_plecoglossicida_T|FPC951|AB009457
1367SAA005_S7CbTm-FD1.b1.ab1_961      98.954   956   0.0   1709   Citrobacter_freundii_T|DSM_30039|AJ233408
1367SAA005_S9-FD1.b1.ab1_1091         99.451   1092  0.0   1984   Cellulosimicrobium_funkei_T|W6122|AY501364
1367SAA005_VNC15-FD1.b1.ab1_1062      99.052   1055  0.0   1890   Rhizobium_alamii_T|type_strain:GBV016|AM931436
1367SAA005_VRP15_bis-FD1.b1.ab1_1124  99.822   1125  0.0   2065   Ensifer_terangae_T|type_strain:_LMG_7834|X68388
1367SAA005_VRP15-FD1.b1.ab1_1112      99.012   1113  0.0   1993   Ensifer_terangae_T|type_strain:_LMG_7834|X68388

The previous examples illustrated how to efficiently combine $Shell$ and $Linux$ tools to process BLAST output tables. With a bit of practice, you will soon master these power tools.

Note: We are using Linux commands to achieve a task that can be performed directly within BLAST, using a customized tabular output, as we will see in the final section on BLASTN “making-use-of-a-customized-tabular-output-with-selected-fields-for-a-slick-and-neat-blastn-based-classification-strategy-of-16s-rdna-sequences”, and explained in the blast[n|p] -help.

5.1.11.1.3 Generate summaries of the results

Finally, let us generate some simple summary statistics of the classified sequences.

# 6 How many sequences could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | wc -l # 42

# 7 How many sequences per genus could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | cut -f6 | sed 's/_.*$//' | sort | uniq -c | sort -nrk1

     14 Mycobacterium
      5 Rhizobium
      5 Ensifer
      4 Pseudomonas
      2 Microbacterium
      2 Escherichia/Shigella
      2 Beijerinckia
      1 Sphingomonas
      1 Sinorhizobium
      1 Shinella
      1 Isoptericola
      1 Citrobacter
      1 Cellulosimicrobium
      1 Brevundimonas
      1 Acinetobacter

5.1.12 Making use of a customized tabular output with selected fields for a slick and neat blastn-based classification strategy of 16S rDNA sequences

The strategy shown above works perfectly well but could be more succinct. However, it helped demonstrate interesting Bash idioms, like the process substitution trick we learned to avoid writing intermediate files to disk. This strategy, or similar, was required to add subject title information to the tabular output generated by the legacy blast programs run with -m 8 (tabular output). The current BLAST+ suite is much more powerful and versatile in multiple ways, including the impressive amount of information from HSPs that can be added to tabular outputs (-outfmt 6|7|10), as can be learned from the blast[npx] -help output.

We will use this feature to significantly simplify the 16S rDNA sequence classification strategy based on best blastn hits satisfying some minimal alignment criteria. In particular, we will use the stitle (subject title) column in the output, which contains the taxonomic information from the FASTA header. In summary, we will run a new blastn search including selected output fields to avoid retrieving the subject hit FASTA files with fastadbcmd, and grep out their header lines to paste them to the table.

To print a tabular output with custom fields use the following syntax: blastn -outfmt ‘6 keyword1 keyword2 …’.

5.1.12.1 Running blastn with a customized tabular outut

The following blastn call with customized tabular output, containing the desired ‘qseqid stitle pident length evalue score’ fields will do the trick. We will also add the query length ‘qlen’ and query coverage ‘qcovs’ fields as the final two columns. We filter the blastn output with AWK to print only HSPs with pident >= 95 && length > 500 and finally format the output with column -t for a tidy columnar display on the terminal. That’s it, slick and neat.

blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt '6 qseqid stitle pident length evalue score qlen qcovs' -num_alignments 1 | awk 'BEGIN{FS=OFS="\t"; print "seqid\tstitle\tpident\tlength\tevalue\tscore\tqlen\tqcovs"}$3 >= 95 && $4 >= 500' | column -t

seqid                                  stitle                                               pident   length  evalue  score  qlen  qcovs
13                                     Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928  99.485   1359    0.0     1340   1359  100
18                                     Mycobacterium_smegmatis_T|ATCC_19420|AJ131761        99.033   1344    0.0     1308   1347  99
2                                      Mycobacterium_peregrinum_T|CIP_105382T|AY457069      99.552   1340    0.0     1321   1343  99
27                                     Mycobacterium_fortuitum_T|CIP_104534T|AY457066       100.000  1342    0.0     1342   1342  100
36                                     Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928  99.705   1358    0.0     1348   1358  100
4                                      Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270     99.335   1354    0.0     1324   1354  99
48                                     Mycobacterium_gordonae_T|ATCC_14470.|X52923          98.375   1354    0.0     1294   1356  99
53                                     Mycobacterium_neworleansense_T|ATCC_49404|AY457068   99.925   1342    0.0     1339   1342  100
62                                     Mycobacterium_chitae_T|ATCC_19627|X55603             99.106   1342    0.0     1304   1342  100
7                                      Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270     98.736   1345    0.0     1292   1350  99
8                                      Mycobacterium_mantenii_T|NLA000401474|FJ042897       99.926   1351    0.0     1347   1350  100
85                                     Mycobacterium_avium_T|ATCC_49884|EF521891            98.968   1357    0.0     1313   1356  100
95                                     Mycobacterium_goodii_T|M069|Y12872                   99.702   1344    0.0     1334   1344  100
96                                     Mycobacterium_arupense_T|AR30097|DQ157760            99.705   1358    0.0     1345   1358  100
1367SAA005_2NXC111R-FD1.b1.ab1_579     Rhizobium_loessense_T|CCBAU_7190B|AF364069           97.396   576     0.0     530    579   99
1367SAA005_2NXC11R-FD1.b1.ab1_1153     Shinella_yambaruensis_T|MS4|AB285481                 96.766   1144    0.0     1031   1153  99
1367SAA005_2NXC14R-FD1.b1.ab1_578      Rhizobium_loessense_T|CCBAU_7190B|AF364069           97.751   578     0.0     538    578   100
1367SAA005_A17-FD1.b1.ab1_1072         Sphingomonas_aromaticivorans_T|DSM_12444|CP000248    97.577   1073    0.0     994    1072  100
1367SAA005_A1b-FD1.b1.ab1_1010         Pseudomonas_plecoglossicida_T|FPC951|AB009457        98.811   1009    0.0     972    1010  99
1367SAA005_A2CbTm-FD1.b1.ab1_908       Escherichia/Shigella_flexneri_T|X96963               99.008   907     0.0     880    908   99
1367SAA005_A5-FD1.b1.ab1_597           Isoptericola_variabilis_T|MX5|AJ298873               97.138   594     0.0     541    597   99
1367SAA005_A7-FD1.b1.ab1_763           Brevundimonas_nasdae_T|GTC1043|AB071954              98.949   761     0.0     737    763   99
1367SAA005_A7Sm-FD1.b1.ab1_896         Acinetobacter_baumannii_T|DSM_30007|X81660           96.763   865     0.0     780    896   97
1367SAA005_A9Cip-FD1.b1.ab1_579        Escherichia/Shigella_flexneri_T|X96963               99.481   578     0.0     569    579   99
1367SAA005_ACOP7-FD1.b1.ab1_1110       Rhizobium_hainanense_T|I66|U71078                    97.212   1112    0.0     1015   1110  100
1367SAA005_AYOT11_bis-FD1.b1.ab1_1118  Beijerinckia_fluminensis_T|UQM_1685|EU401907         98.749   1119    0.0     1076   1118  100
1367SAA005_AYOT11-FD1.b1.ab1_1069      Beijerinckia_fluminensis_T|UQM_1685|EU401907         99.147   1055    0.0     1027   1069  99
1367SAA005_CP6-FD1.b1.ab1_1117         Ensifer_saheli_T|LMG_7837|X68390                     98.743   1114    0.0     1070   1117  99
1367SAA005_CP7_bis-FD1.b1.ab1_819      Ensifer_mexicanus_T|ITTG-R7|DQ411930                 98.862   791     0.0     763    819   96
1367SAA005_CP7-FD1.b1.ab1_1035         Sinorhizobium_americanum_T|CFNEI_156|AF506513        98.551   1035    0.0     988    1035  99
1367SAA005_NXC126R-FD1.b1.ab1_1017     Ensifer_adhaerens_T|LMG_20216|AM181733               98.527   1018    0.0     972    1017  99
1367SAA005_PROO4-FD1.b1.ab1_539        Rhizobium_loessense_T|CCBAU_7190B|AF364069           97.407   540     0.0     493    539   100
1367SAA005_S11Sm-FD1.b1.ab1_1049       Pseudomonas_monteilii_T|CIP_104883|AF064458          98.763   1051    0.0     1010   1049  100
1367SAA005_S12-FD1.b1.ab1_899          Microbacterium_testaceum_T|DSM_20166|X77445          98.210   894     0.0     844    899   99
1367SAA005_S15-FD1.b1.ab1_1111         Microbacterium_testaceum_T|DSM_20166|X77445          98.115   1114    0.0     1048   1111  100
1367SAA005_S1CbTm-FD1.b1.ab1_683       Pseudomonas_monteilii_T|CIP_104883|AF064458          99.227   647     0.0     631    683   95
1367SAA005_S1-FD1.b1.ab1_542           Pseudomonas_plecoglossicida_T|FPC951|AB009457        99.815   542     0.0     539    542   100
1367SAA005_S7CbTm-FD1.b1.ab1_961       Citrobacter_freundii_T|DSM_30039|AJ233408            98.954   956     0.0     925    961   99
1367SAA005_S9-FD1.b1.ab1_1091          Cellulosimicrobium_funkei_T|W6122|AY501364           99.451   1092    0.0     1074   1091  99
1367SAA005_VNC15-FD1.b1.ab1_1062       Rhizobium_alamii_T|type_strain:GBV016|AM931436       99.052   1055    0.0     1023   1062  99
1367SAA005_VRP15_bis-FD1.b1.ab1_1124   Ensifer_terangae_T|type_strain:_LMG_7834|X68388      99.822   1125    0.0     1118   1124  100
1367SAA005_VRP15-FD1.b1.ab1_1112       Ensifer_terangae_T|type_strain:_LMG_7834|X68388      99.012   1113    0.0     1079   1112  100

This final blastn example demonstrates the importance of knowing our tools well, meaning that we should always take the time to carefully read the software’s documentation to understand and exploit all its key features.

5.1.13 HOMEWORK 1

Repeat the blastn-based classification exercise based on 16S rDNA sequences using 10 hits per query. Impose the same filters applied in the previous exercise.

Are the results of both blastn runs (single best-hit vs. 10 hits) consistent?
How many sequences from the problem dataset remain unclassified, and what is the reason they were not classified?
Are there sequences that remain unclassified that have a pident >= 90% && < 95%?
- If so, how many?
- Based on the evidence gathered from the 10 hits/query, how would you classify these sequences?

5.1.14 HOMEWORK 2

Repeat the blastn-based classification exercise based on 16S rDNA sequences using the NCBI nucletotide BLAST portal. Read the BLAST-help documentation if you need assistance.

Use the new core_nt default database, and retrieve only 10 hits per query.
Repeat the search using the refseq_select RNA database
Generate two tables with the classification results of our problem sequences based on the best hits found in the core_nt and refseq_select databases. Compare them with each other and the results obtained with our small local database of reference 16S sequences.

6 BLAST+ with protein queries against a protein database (BLASTP)

BLASTP is a fundamental program in genomic and phylogenomic research, as it is one of the main tools used to:

search for homologs (orthologs and paralogs) of query proteins across or within genomes for downstream evolutionary/phylogenetic/comparative analyses
annotate the proteomes of newly-sequenced genomes, typically using a reference database of well-annotated proteins, such as SwissProt, to transfer the annotations of the best hit to the query

6.1 Prepare working directory

extract the sequence data from the compressed tar file

# Move into the UniProt_proteomes dir and extract the sequences from the provided tgz file
cd $wkdir && cd UniProt_proteomes 
tar -xvzf UniProt_proteomes.tgz

6.2 EXERCISE 2: identification of paralogs in large multigene families of eukaryotes - searching for small Rab GTPases in the proteome of Acanthamoeba castellanii strain Neff

The family of small Rab GTPases belongs to the Ras GTPase superfamily, which comprise over 150 human members, with evolutionary conserved orthologs found throughout Eukaryota. The superfamily is divided into five major branches on the basis of sequence and functional similarities: Ras, Rho, Rab, Ran and Arf.

Small GTPases share a common biochemical mechanism and act as binary molecular switches, being active when GTP is bound and inactive when GDP is bound to the protein. Ras family proteins function as monomeric G proteins. Variations in structure and post-translational modifications that dictate specific subcellular locations and the proteins that serve as their regulators and effectors allow these small GTPases to function as sophisticated modulators of a remarkably complex and diverse range of cellular processes.

Rab GTPases are critical in membrane trafficking and insert into specific membrane compartments via their C-terminal geranylgeranyl lipidic tails, serving as organelle identity posts. In their GTP-bound (active) form, Rabs recruit multiple and specific effector proteins, which interact with membrane tethering (e.g., HOPS) and fusion factors (SNAREs), thereby serving a critical role in controlling the specific docking and fusion of membrane-bound compartments along the endocytic, recycling, exocytic, phagocytic and autophagic pathways.

In our research group at CCG-UNAM we use evolutionary, comparative, and functional genomics approaches to study Acanthamoeba castellanii and Stenotrophomonas spp. to identify essential loci and the molecular mechanisms that underlie the origins of complex phenotypes such as virulence, focusing on the cellular microbiology of Acanthamoeba-Stenotrophomonas interactions. We have recently found that Stenotrophomonas maltophilia, a globally emerging multidrug-resistant opportunistic pathogen, can use Acanthamoeba trophozoites as an alternative replication niche.

Using a combination of bioinformatics, molecular, and cell-biological approaches, we discovered that S. maltophilia is resistant to digestion by A. castellanii, a free-living amoeba that we use in our lab as a professional phagocyte model (Rivera et al. (2024). More specifically, we found that:

Many Rab proteins in the A. castellanii reference (RefSeq) proteome are mis-annotated, including errors in their predicted gene structures, as shown in the following figure for the highly conserved Rab7 protein:

Figure from: Rivera et al. (2024). Phylogenomic, structural, and cell-biological analyses reveal that Stenotrophomonas maltophilia replicates in acidified Rab7A-positive vacuoles of Acanthamoeba castellanii. Microbiology Spectrum, 2024 Mar 5;12(3):e0298823.. doi: 10.1128/spectrum.02988-23

Stenotrophomonas maltophilia bacteria replicate withing acidified, Rab7A-positive vacuoles (phagosomes) of Acanthamoeba castellanii trophozoites, as shown below by live-cell imaging of amoebal trophozoites with phagocytosed fluorescent dextran, a standard fluid-phase endocytic tracer, and living S. maltophilia cells stained with the pH-sensitive pHrodo red dye:

In the following exercise, we will use $blastp$ to study the large family of Rab GTPases encoded by the Acanthamoeba castellanii genome, using the well-annotated and canonical RAB7A_HUMAN protein as the query.

6.2.1 Format the Acanthamoeba castellanii reference proteome fetched from UNIPROT

UniProt is the world’s leading high-quality, comprehensive, and freely accessible protein sequence and functional information resource. Explore this extraordinary resource and read the following paper from the UniProt Consortium - UniProt: the Universal Protein Knowledgebase in 2023.

Let’s explore the source FASTA files starting with the A. castellanii Neff proteome (ACACA.faa), which will be used to generate the BLASTDB

$ grep '>' ACACA.faa | head

>tr|L8GC88|L8GC88_ACACA Protein kinase domain/MORN repeatcontaining protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_179960 PE=4 SV=1
>tr|L8GCQ0|L8GCQ0_ACACA Uncharacterized protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_180790 PE=4 SV=1
>tr|L8GD08|L8GD08_ACACA Uncharacterized protein (Fragment) OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_338290 PE=4 SV=1
>tr|L8GE04|L8GE04_ACACA Uncharacterized protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_102890 PE=4 SV=1
>tr|L8GE21|L8GE21_ACACA CUE domain containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_191030 PE=4 SV=1
>tr|L8GEJ3|L8GEJ3_ACACA Uncharacterized protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_351910 PE=4 SV=1
>tr|L8GEQ1|L8GEQ1_ACACA Transcription initiation factor iid, 31kd subunit protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_389090 PE=3 SV=1
>tr|L8GFB4|L8GFB4_ACACA Uncharacterized protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_121250 PE=4 SV=1
>tr|L8GG36|L8GG36_ACACA RapGAP/RanGAP domain containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_387100 PE=4 SV=1
>tr|L8GGJ9|L8GGJ9_ACACA Ubiquitin-specific protease, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_190280 PE=4 SV=1

Note that the FASTA headers have a canonical structure and do not require any further editing before formatting with $makeblastdb$.

makeblastdb -in ACACA.faa -dbtype prot -parse_seqids

ls -1

ACACA.faa
ACACA.faa.pdb
ACACA.faa.phr
ACACA.faa.pin
ACACA.faa.pog
ACACA.faa.pos
ACACA.faa.pot
ACACA.faa.psq
ACACA.faa.ptf
ACACA.faa.pto
DICDI.faa

6.2.2 Default blastp run, and anatomy of its output

blastp requires only two arguments to run: the -query and the -db , running under default parameters.

blastp -query RAB7A_HUMAN.faa -db ACACA.faa | less

which produces the following (abbreviated/truncated) output, with the following sections:

header
one-line hit descriptions
alignments
footer

BLASTP 2.14.1+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.


Database: ACACA.faa
           14,938 sequences; 6,560,263 total letters


Query= sp|P51149|RAB7A_HUMAN Ras-related protein Rab-7a OS=Homo sapiens
OX=9606 GN=RAB7A PE=1 SV=1

Length=207
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str...  293     2e-102
L8GPH2 Rab7/RabGfamily small GTPase OS=Acanthamoeba castellanii s...  249     4e-85 
L8GXM9 GTPbinding protein YPTC5, putative OS=Acanthamoeba castell...  190     7e-62 
L8GRY8 Rab7/RabGfamily small GTPase OS=Acanthamoeba castellanii s...  180     4e-58 
L8GLL4 RAB family member (Rab7), putative OS=Acanthamoeba castell...  166     2e-52 
L8HGU8 Ras family protein OS=Acanthamoeba castellanii str. Neff O...  154     6e-48 
L8HEH5 Rab7, putative OS=Acanthamoeba castellanii str. Neff OX=12...  149     8e-46 
L8H231 GTPbinding protein OS=Acanthamoeba castellanii str. Neff O...  142     2e-43 
L8HDY4 Ras subfamily protein OS=Acanthamoeba castellanii str. Nef...  137     4e-41 
L8GSP6 Ras-related protein Rab-2A, putative OS=Acanthamoeba caste...  134     7e-40 
L8GKG8 Rasrelated protein Rab5, putative OS=Acanthamoeba castella...  133     2e-39 
L8H896 RAB1B, member RAS oncogene family OS=Acanthamoeba castella...  132     6e-39
... TRUNCATED
L8HKA3 HEAT repeat domain containing protein OS=Acanthamoeba cast...  26.6    7.0   
L8GYP1 Ras subfamily protein OS=Acanthamoeba castellanii str. Nef...  26.2    7.3   
L8GLL9 WD domain, G-beta repeat-containing protein OS=Acanthamoeb...  26.6    7.5   
L8H6G5 Uncharacterized protein OS=Acanthamoeba castellanii str. N...  26.2    8.7   
L8HK33 GTPase OS=Acanthamoeba castellanii str. Neff OX=1257118 GN...  26.2    9.5   
L8H1F4 Myosin VIIa, putative OS=Acanthamoeba castellanii str. Nef...  26.2    9.6   
L8GRW1 Rab-like protein 6 OS=Acanthamoeba castellanii str. Neff O...  26.2    9.7   
L8HDD1 Cobalamin synthesis protein/P47K OS=Acanthamoeba castellan...  26.2    9.9   
L8H2D9 ABC transporter, ATPbinding domain containing protein OS=A...  26.2    9.9   


>L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str. Neff 
OX=1257118 GN=ACA1_074580 PE=4 SV=1
Length=186

 Score = 293 bits (749),  Expect = 2e-102, Method: Compositional matrix adjust.
 Identities = 149/208 (72%), Positives = 166/208 (80%), Gaps = 23/208 (11%)

Query  1    MTSRKKVLLKVIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDRLVTMQ  60
            M++RKKVLLK+IILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDD+LVT+Q
Sbjct  1    MSTRKKVLLKIIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDKLVTLQ  60

Query  61   IWDTAGQERFQSLGVAFYRGADCCVLVFDVTAPNTFKTLDSWRDEFLIQASPRDPENFPF  120
            IWDTAGQERFQSLGVAFYRGAD CVLVFDV                   A PRDPE FPF
Sbjct  61   IWDTAGQERFQSLGVAFYRGADACVLVFDVN------------------AGPRDPETFPF  102

Query  121  VVLGNKIDLE-NRQVATKRAQAWCYSKNNIPYFETSAKEAINVEQAFQTIARNALKQETE  179
            +VLGNKIDLE +R V+ KRAQ WC SK NIPYFETSAKEAINVEQAFQTIA+NA+K+E +
Sbjct  103  IVLGNKIDLESSRVVSQKRAQTWCQSKGNIPYFETSAKEAINVEQAFQTIAKNAMKEEED  162

Query  180  VELYNEFPEPIKLDKNDRAKASAESCSC  207
            V+L     + I+LDK++    S   CSC
Sbjct  163  VDLAG-ITDAIQLDKSE---PSQSGCSC  186

... TRUNCATED

>L8GVA9 Fbox domain containing protein OS=Acanthamoeba castellanii str. 
Neff OX=1257118 GN=ACA1_042330 PE=4 SV=1
Length=226

 Score = 27.7 bits (60),  Expect = 2.3, Method: Compositional matrix adjust.
 Identities = 23/100 (23%), Positives = 40/100 (40%), Gaps = 15/100 (15%)

Query  11   VIILGDSGVGKTSL-------MNQYVNKKFSNQYKAT----IGADFLTKEVMVDDRLVTM  59
            +++LGD GVGKT+          +   KK   ++        G   LT         V +
Sbjct  114  ILVLGDVGVGKTAYAERCLAGAQESTTKKKKGRFDMVALRGTGKHHLTLPTNYGP--VNI  171

Query  60   QIWDTAGQERFQSLGV--AFYRGADCCVLVFDVTAPNTFK  97
             +WD     R  S GV      G    +++FD+ +  +++
Sbjct  172  TLWDQPATSREISGGVRDELLAGGQAAIIMFDIASLQSYR  211



Lambda      K        H        a         alpha
   0.318    0.132    0.386    0.792     4.96 

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6 

Effective search space used: 603304980


  Database: ACACA.faa
    Posted date:  Oct 7, 2023  9:42 AM
  Number of letters in database: 6,560,263
  Number of sequences in database:  14,938


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40

Note: The header and footer sections contain essential information (m, n, k, $\lambda$) required to compute the E-value (see Karlin-Altschul equation $E = kmne^{-{\lambda}S}$, as detailed in a previous section).

Remember, this critical information can be easily extracted from the standard BLAST output using, for example, an AWK one-liner, such as the following one:

blastp -query RAB7A_HUMAN.faa -db ACACA.faa | awk '/Database/, /total letters/; /^Lambda/, /Extension:/{seen++; print}{ if(seen>5) exit}'

Database: ACACA.faa
           14,938 sequences; 6,560,263 total letters
Lambda      K        H        a         alpha
   0.318    0.132    0.386    0.792     4.96 

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6

6.2.2.1 Optional exercise 1

Compute the P-value associated with the last alignment shown on the output above using R or a calculator.

6.2.3 Deep $BLASTP$ search to find A. castellanii Ras-superfamily members, including Rabs, using RAB7A_HUMAN as the query protein and a customized oputput table with selected fields

The aim of this exercise is to identify A. castellanii Ras-superfamily members, using RAB7A_HUMAN as the query. Given the huge evolutionary distance separating humans from Amoebozoa (probably ~ 1.2 billion yrs), we are going to use the BLOSUM45 matrix to increase sensitivity.

Fig. 1 from Brocks et. al 2023. Nature 2023 Jun;618(7966):767-773. doi: 10.1038/s41586-023-06170-w.

Furthermore, since the Ras superfamily is expected to contain many members, possibly several hundred, we will request 300 hits using custom output fields in tabular format.

blastp -matrix BLOSUM45 -query RAB7A_HUMAN.faa -db ACACA.faa -outfmt '6 qseqid qlen sseqid slen stitle length pident gaps evalue bitscore qcovs' -num_alignments 300 > RAB7A_HUMAN_vs_ACACA_300hits.out

lets find out how many hits we got

wc -l RAB7A_HUMAN_vs_ACACA_300hits.out

6.2.3.1 Explore the distribution of key alignment statistics using the col_sumStats.sh script

There are many (244) rows to explore. It is convenient to get simple summary statistics for key alignment parameters, such as: - E-value (column 9) - bit_score (column 10) - qcovs (column 11) - slen (column 4) - pident (column 7)

Copy the following code to a file named col_sumStats.sh and move it to your $HOME/bin directory

#!/usr/bin/env bash
# AUTHOR: Pablo Vinuesa. CCG-UNAM. @pvinmex
# little awk script to compute basic summary statistics
#  receives the output of cut for a numeric column in a table 
sort -n | awk '
BEGIN{ max = min = Mode = 'NaN' }
{     
     sum+=$1
     a[x++]=$1
     b[$1]++
     if(b[$1]>hf){hf=b[$1]}
}
NR == 1 { min=$1; max=$1 }
$1 < min { min=  $1 }
$1 > max { max=  $1 }

END{ n = asort(a);idx=int((x+1)/2)
    print "Min: " min
    print "Mean: " sum/x 
    print "Median: " ((idx==(x+1)/2) ? a[idx] : (a[idx]+a[idx+1])/2) 
    for (i in b){if(b[i]==hf){(k=="") ? (k=i):(k=k FS i)}{FS=","}}
    print "Mode: " k
    print "Max: " max
}
'

Make the file executable and move it to your ~/$HOME/bin directory to be able to use it as a system-wide command/program.

chmod 755 col_sumStats.sh && [ -d ~/$HOME/bin ] && mv col_sumStats.sh ~/$HOME/bin

Alternatively, you may fetch it with the following command from my GitHub repo directly into your $HOME/bin directory

# make sure that we have a bin directory in our home and cd into it
[ ! -d ~/bin ] && mkdir ~/bin
[ -d ~/bin ] && cd ~/bin

wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/col_sumStats.sh
chmod 755 col_sumStats.sh

# takes you back again to your working directory
cd -

Now run the following lines of code:

echo "# E-value"
cut -f 9 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# bit score"
cut -f 10 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# qcovs"
cut -f 11 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# pident"
cut -f 7 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# slen"
cut -f 4 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh

# E-value
Min: 6.25e-103
Mean: 0.250727
Median: 9.805e-15
Mode: 0.001
Max: 6.9

# bit score
Min: 26.3
Mean: 72.9398
Median: 68.85
Mode: 27.5
Max: 292

# qcovs
Min: 9
Mean: 68.6885
Median: 77
Mode: 78
Max: 100

# pident
Min: 20.000
Mean: 31.7386
Median: 30.5675
Mode: 31.452
Max: 71.635

# slen
Min: 94
Mean: 440.131
Median: 217.5
Mode: 193,195
Max: 2812

From the output above, we can conclude that there are many hits with poor E-values (>= 0.001) and scores (< 30). There is also great variation in the subject sequence length (slen) and query coverage (qcovs).

NOTE: if you did not place the col_sumStats.sh script it to your $HOME/bin directory, or any other one in the $PATH, you will have to execute the commands as follows:

echo "# E-value"
cut -f 9 RAB7A_HUMAN_vs_ACACA_300hits.out | ./col_sumStats.sh
...

6.2.3.2 Explore the alignment length distribution using a simple tabular plus graphical representation on the terminal and in $R$

The following AWK one-liner will print the the sequence length frequencies for those values found >=2 times, and provide a raw barplot for visual inspection.

awk -F"\t" '{l=$4; a[l]++; if (a[l]>max) max=a[l]} END {printf("Length\tFrequency"); for (i in a) if(a[i] >= 2){printf("\n%s\t%s\t",i,a[i]); for (j=0;j<(int(a[i]*(100/max)));j++) printf("*")} print ""}' RAB7A_HUMAN_vs_ACACA_300hits.out

Length  Frequency
178 2   *********************************
179 3   **************************************************
180 3   **************************************************
181 4   ******************************************************************
184 5   ***********************************************************************************
186 5   ***********************************************************************************
188 3   **************************************************
190 4   ******************************************************************
193 6   ****************************************************************************************************
194 2   *********************************
195 6   ****************************************************************************************************
196 3   **************************************************
197 3   **************************************************
198 3   **************************************************
199 4   ******************************************************************
200 3   **************************************************
202 2   *********************************
203 3   **************************************************
204 3   **************************************************
205 2   *********************************
206 5   ***********************************************************************************
207 2   *********************************
208 3   **************************************************
209 2   *********************************
210 3   **************************************************
211 3   **************************************************
213 3   **************************************************
216 5   ***********************************************************************************
217 2   *********************************
218 2   *********************************
220 2   *********************************
221 3   **************************************************
223 2   *********************************
228 4   ******************************************************************
244 3   **************************************************
264 2   *********************************
266 2   *********************************
273 2   *********************************
421 2   *********************************
514 2   *********************************
558 2   *********************************
578 2   *********************************
988 2   *********************************

From the output above, it seems that most homologs fall in the length range between 178 and 250 residues.

6.2.4 Using base $R$ to explore BLAST tabular output

R is ‘GNU S’, a freely available language and environment for statistical computing and graphics that provides various statistical and graphical techniques: linear and nonlinear modeling, statistical tests, time series analysis, classification, clustering, etc.

CRAN is a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. Currently, over 20000 contributed packages can be freely downloaded to boost your R coding and analysis capacity to unlimited heights.

R is currently the mainstream language and environment for statistical computing and graphics used by statisticians, data scientists, and researchers. Here we will be using only straightforward base R to make it easy for you to run the simple code examples.

6.2.4.1 A better histogram, including a density plot, can be easily generated with base $R$ using the following code:

# call R
R

# read data
dfr <- read.csv(file = "RAB7A_HUMAN_vs_ACACA_300hits.out", sep = "\t", header=FALSE)

# explore the structure of the object dfr
str(dfr)

# plot ACACA blastp subject length density
hist(dfr$V4, breaks = "Freedman-Diaconis", xlim = c(0, 3000), probability=TRUE, xlab="length", main="ACACA blastp subject length density")
abline(v = median(dfr$V4), col = "blue", lwd = 3)
abline(v = mean(dfr$V4), col = "red", lwd = 3)
lines(density(dfr$V4), col = 'green', lwd = 3)

dev.off()

which generates the following graph:

Figure 4. Histogram and density plot of RAB7A_HUMAN blastp hit lengths in the A. castellanii proteome. The median and mean lengths are highlighted as blue and red vertical bars, respectively.

6.2.4.2 Distribution of E-values as a function of bitscores (`eval ~ score`)

It is interesting to analyze the distribution of E-values as a function of bit scores (eval ~ score) to find the minimal alignment scores required for a specific E-value cutoff, like 0.001. This can be easily achieved calling base R’s plot() function.

# Another interesting plot is that of E-values ~ bit-scores
plot(dfr$V9 ~ dfr$V10, main ="E-values ~ bitscores", ylab="E-values", xlab="bitscores")

dev.off()

# let's zoom-in by sub-setting the dfr for E-values <= 0.2 and scores <= 50, making use of the subset() function
dfr2 <- subset(dfr, V9 <= 0.2 & V10 < 50)
plot(dfr2$V9 ~ dfr2$V10, main ="E-values ~ bitscores", ylab="E-values", xlab="bitscores")
abline(h=0.001, col="red")

dev.off()

which generates the following graphs:

Figure 5. E-values as a function of scores

Figure 6. E-value ~ bitscores (zoom-in)

The latter figure shows that E-values remain below 0.001 for scores > ~38, which may be likely Ras superfamily homologs.

Exercises:
- Write a simple AWK one-liner to verify this.
- Write R code to identify the minimal score associated with an E-value <= 0.001.

6.2.4.3 Distribution of E-values as a function of P-values (eval ~ p-value) - the `e2p()` function

Using R and the “RAB7A_HUMAN_vs_ACACA_300hits.out” blastp output generated in this section, we will add a column to the data frame with the computed P-values for each E-value using the user-defined function e2p(). Then, we will generate a simple scatter plot of E-values vs. P-values (Eval ~ Pval).

# 3. >>> Analyze the relation of E-values ~ P-values <<<
# 3.1  define the e2p function to compute the P-values from provided E-values in the x vector
# P-val=1-exp(-E)
e2p <- function(x) {   
  # compute the P-values from provided E-values in the x vector
  # using the formula: P-val=1-exp(-E)
  1-exp(-x)
}

# add a new variable V12 (P-val) to the dataframe
# which will hold the P-values computed by e2p 
# from the E-values stored in V9
dfr$V12<- e2p(dfr$V9)  

# zoom-in by sub-setting the dfr for E-values <= 0.001 and P-values <= 0.001 
# making use of the subset() function.
# The following analysis 
dfr2 <- subset(dfr, V9 <= 0.001 & V12 < 0.001)
c <- round(cor(dfr2$V9, dfr2$V12),3)
header <- paste0("E-values ~ P-values; ", "corr: ", c)
plot(dfr2$V9 ~ dfr2$V12, main ="E-values ~ P-values", ylab="E-values", xlab="P-values")
fit <- lm(dfr2$V9 ~ dfr2$V12)
abline(fit)

dev.off()

summary(fit)

# plot E-values ~ P-values
plot(dfr$V9 ~ dfr$V12, main ="E-values ~ P-values", ylab="E-values", xlab="P-values")

dev.off()

# plot two plots in one row
par(mfrow=c(1,2))
plot(dfr2$V9 ~ dfr2$V12, main ="E-values ~ P-values zoom-in; corr = 1", ylab="E-values", xlab="P-values")
abline(fit)
plot(dfr$V9 ~ dfr$V12, main ="E-values ~ P-values", ylab="E-values", xlab="P-values")
abline(fit)

dev.off()

par(opar)

# save to file
png(filename = "Eval_vs_Pval_with_zoomin_1x2.png", width = 960, height = 480)
par(mfrow=c(1,2))
plot(dfr2$V9 ~ dfr2$V12, main ="E-values ~ P-values zoom-in", ylab="E-values", xlab="P-values")
abline(fit)
plot(dfr$V9 ~ dfr$V12, main ="E-values ~ P-values", ylab="E-values", xlab="P-values")
abline(fit)
dev.off()
par(opar)

# quit
q(save="no")

The following is the output of the summary() function for the fitted linear model summary(fit) for the subsetted dataset, corroborating our previous statement that at E-values < 0.01 the P-values are essentially equal to E-values.

Call:
lm(formula = dfr2$V9 ~ dfr2$V12)

Residuals:
       Min         1Q     Median         3Q        Max 
-9.492e-08  2.572e-09  2.572e-09  2.572e-09  5.861e-08 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -2.572e-09  1.235e-09     -2.082   0.0386 *  
dfr2$V12     1.000e+00  9.285e-06 107747.141   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.735e-08 on 203 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.161e+10 on 1 and 203 DF,  p-value: < 2.2e-16

The following graphs show the scatter plots of E-value ~ P-value of the subsetted data frame (Eval < 0.01 && P-val < 0.01) and the full data frame values, with the fitted linear model on the subsetted data added to both graphs.

6.2.5 Filter likely canonical Rab GTPase hits based on key alignment parameters

From the previous exploratory data analyses, we can conclude that the hits for likely canonical small Rab-GTPase family homologs should meet the following (relaxed) attributes:

- slen >= 178 && slen < 270
- E-value < 1e-10
- bit-score > 70
- qcovs > 72
- pident > 28

Filtering the table according to those criteria yields 80 hits:

awk 'BEGIN{FS=OFS="\t"; print "sseqid\tslen\tpident\tgaps\tevalue\tbitscore\tqcovs"} ($4 > 178 && $4 < 270) && $7 >= 28 && $9 < 1e-3 && $10 > 70 && $11 > 72 {print $3,$4,$7,$8,$9,$10,$11}' RAB7A_HUMAN_vs_ACACA_300hits.out | column -t -s $'\t'

sseqid                  slen  pident  gaps  evalue     bitscore  qcovs
tr|L8HHF3|L8HHF3_ACACA  186   71.635  23    6.25e-103  292       100
tr|L8GPH2|L8GPH2_ACACA  205   58.537  2     1.20e-86   251       99
tr|L8GXM9|L8GXM9_ACACA  197   60.736  2     5.09e-64   193       78
tr|L8GRY8|L8GRY8_ACACA  186   47.343  28    1.02e-58   179       98
tr|L8GLL4|L8GLL4_ACACA  206   50.000  4     4.58e-54   168       78
... TRUNCATED
tr|L8GLE5|L8GLE5_ACACA  193   28.302  3     9.76e-18   74.2      77
tr|L8H8M5|L8H8M5_ACACA  228   28.986  27    1.72e-17   74.2      90
tr|L8GIH3|L8GIH3_ACACA  198   29.944  11    3.36e-17   73.0      86
tr|L8GIM0|L8GIM0_ACACA  203   34.177  7     9.07e-17   71.9      75
tr|L8HCN9|L8HCN9_ACACA  194   31.447  32    1.04e-16   71.6      77

To further validate the selected hits as canonical small Rab-GTPases, these should be aligned and subjected to phylogenetic analysis along with reference sequences. We will perform this analysis in later sections of the course.

6.3 EXERCISE 3: Identify HUMAN vs. Acanthamoeba reciprocal best hits (RBHs) in the Rab-GTPase multigene family

The previous exercise identified a set of Rab-GTPase homologs from the A. castellanii proteome. Our next aim is to identify reciprocal best-hits or RBHs between a set of human reference Rabs retrieved from UNIPROT and saved in file HUMAN_63Rabs_Ran.faa and the A. castellanii proteome. The resulting RBHs are likely orthologs; therefore, we could transfer the annotation of the HUMAN Rabs to their Acanthamoeba RBHs.

6.3.1 Overview of the bioinformatic strategy to find RBHs between the A. castellanii proteome and the HUMAN_63Rabs_Ran.faa reference proteins

The bioinformatics strategy to find RBHs:

run $blastp$ against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as query with the BLOSUM45 matrix
retrieve the A. castellanii proteome database hits using $blastdbcmd$
make a blastdb for HUMAN_63Rabs_Ran.faa and query it with the retrieved A. castellanii proteome database hits
retrieve a non-redundant list of the ACACA IDs from the second blastp run and use it to grep out the matching lines from the first blastp run, sorting it by increasing E-values
filter out unique HUMAN vs. ACACA RBHs using AWK hashes from the sorted blast output table

6.3.1.1 Prepare a working directory for the RBH searches

Use symlinks to avoid unnecessary file copying to preserve precious disk space whenever possible.

mkdir RBHs
cd RBHs/
  
# make symlinks whenever possible, to avoid unnecessary file copying to avoid filling the server's disks
ln -s ../ACACA.faa .
ln -s ../ACACA.faa.* .
ln -s ../HUMAN_63Rabs_Ran.faa .

6.3.1.2 Run $blastp$ against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as the query with the BLOSUM45 matrix

Given the huge evolutionary distance between human and amoebae (> 1 billion yrs), we may use the BLOSUM45 matrix to maximize sensitivity.
We will use a customized tabular output with the following fields: ‘qseqid sseqid pident gaps length qlen slen qcovs evalue score’, saved in a variable

# set BLAST and filtering variables
mat=BLOSUM45
cols='6 qseqid sseqid pident gaps length qlen slen qcovs evalue score'
min_qcov=80

blastp -query HUMAN_63Rabs_Ran.faa -db ACACA.faa -matrix "$mat" -outfmt "$cols" -num_alignments 1 > HUMAN_63Rabs_vs_ACACA_besthits_m6.out

6.3.1.3 Retrieve the A. castellanii proteome database hits using $blastdbcmd$ only if qcov > 80.

Note the following syntactic details:

use of Bash’s process substitution idiom <(your code here)
awk -F”\t” -v qcov=$min_qcov ‘$8 >= qcov{print $2}’ HUMAN_63Rabs_vs_ACACA_besthits_m6.out to filter out Acanthamoeba hits with qcov >=$min_qcov.

blastdbcmd -entry_batch <(awk -F"\t" -v qcov=$min_qcov '$8 > qcov{print $2}' HUMAN_63Rabs_vs_ACACA_besthits_m6.out) -db ACACA.faa > ACACA_vs_HUMAN_best_1hits.faa

6.3.1.4 Make a blastdb for HUMAN_63Rabs_Ran.faa, and query it with the retrieved A. castellanii proteome database hits

generate the HUMAN_63Rabs_Ran.faa BLASTdb
run blastp with the filtered Acanthamoeba hits from the first BLAST as queries against the HUMAN_63Rabs_Ran.faa BLASTdb, retrieving only the best hit

makeblastdb -in HUMAN_63Rabs_Ran.faa -dbtype prot -parse_seqids
blastp -query ACACA_vs_HUMAN_best_1hits.faa -db HUMAN_63Rabs_Ran.faa -matrix "$mat" -outfmt "$cols" -num_alignments 1 > ACACA_vs_HUMAN_best_1hit_m6.out

6.3.1.5 Retrieve a non-redundant list of the ACACA IDs from the second blastp run and use it to grep out the matching lines from the first blastp run, sorting it by increasing E-values

# The following lines extract a sorted list (n=54) of unique Acanthamoeba identifiers 
#   found in the ACA_vs_HUMAN and HUMAN_vs_ACA searches
cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort | uniq -c
cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u

# This code greps out the sorted and unique ACACA hits from the ACACA_vs_HUMAN_best_1hit_m6.out table
while read ACAid; do grep -w "$ACAid" HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done < <(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u)

# Same as previous line, but sorting the BLAST output table by increasing E-values (in column 9; sort -gk9,9)
#  output displayed below
while read ACAid; do grep -w "$ACAid" HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done < <(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u) | sort -gk9,9 | column -t
while read ACAid; do grep -w "$ACAid" HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done < <(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u) | sort -gk9,9 > HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted

Sorted output table generated by last line of code, by increasing E-values. Note the presence of repeated Acanthamoeba identifiers (2cnd col)

sp|P62826|RAN_HUMAN    tr|L8GF16|L8GF16_ACACA  82.524  0   206  216  213  95   7.64e-129  1259
sp|P61106|RAB14_HUMAN  tr|L8H946|L8H946_ACACA  73.953  2   215  215  213  100  2.88e-117  1154
sp|P62491|RB11A_HUMAN  tr|L8HJ43|L8HJ43_ACACA  75.829  0   211  216  213  98   9.90e-117  1150
sp|Q15907|RB11B_HUMAN  tr|L8HJ43|L8HJ43_ACACA  75.463  8   216  218  213  98   2.83e-113  1119
sp|Q8WUD1|RAB2B_HUMAN  tr|L8H2N2|L8H2N2_ACACA  70.046  2   217  216  216  100  4.48e-110  1090
sp|P61019|RAB2A_HUMAN  tr|L8H2N2|L8H2N2_ACACA  69.585  6   217  212  216  100  2.42e-109  1083
sp|P20340|RAB6A_HUMAN  tr|L8GN15|L8GN15_ACACA  81.287  0   171  208  195  82   1.25e-104  1037
sp|Q9NRW1|RAB6B_HUMAN  tr|L8GN15|L8GN15_ACACA  75.843  0   178  208  195  86   1.36e-103  1028
sp|P51149|RAB7A_HUMAN  tr|L8HHF3|L8HHF3_ACACA  71.635  23  208  207  186  100  6.25e-103  1021
sp|P61018|RAB4B_HUMAN  tr|L8HCK2|L8HCK2_ACACA  65.888  11  214  213  204  100  2.29e-102  1018
sp|P20338|RAB4A_HUMAN  tr|L8HCK2|L8HCK2_ACACA  65.421  11  214  218  204  98   9.55e-99   986
sp|Q13637|RAB32_HUMAN  tr|L8HAC7|L8HAC7_ACACA  68.681  2   182  225  220  80   9.57e-94   943
sp|P57729|RAB38_HUMAN  tr|L8HAC7|L8HAC7_ACACA  69.886  1   176  211  220  83   1.31e-93   940
sp|P57735|RAB25_HUMAN  tr|L8HJ43|L8HJ43_ACACA  59.330  4   209  213  213  96   2.16e-91   920
sp|P62820|RAB1A_HUMAN  tr|L8HD31|L8HD31_ACACA  69.412  0   170  205  210  83   4.80e-91   915
sp|Q9H0U4|RAB1B_HUMAN  tr|L8HD31|L8HD31_ACACA  69.412  0   170  201  210  85   6.47e-91   914
sp|Q92928|RAB1C_HUMAN  tr|L8HD31|L8HD31_ACACA  66.864  0   169  201  210  84   5.35e-85   860
... TRUNCATED

6.3.1.6 Filter out unique HUMAN vs ACA RBHs using AWK hashes from the sorted blast output table to identify orthologs as RBHs

Note: if you need help with AWK and Linux for bioinformatics, visit my extensive AWK tutorial for bioinformatics here

awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print }' HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted | column -t
awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print }' HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted > HUMAN_63Rabs_vs_ACACA_RBHs.tsv

The final table of fifteen HUMAN vs. ACACA Rab-GTPase RBHs (plus the Ran RBH), which are likely orthologs, filtered by

qcov >=
$min_qcov

sp|P62826|RAN_HUMAN    tr|L8GF16|L8GF16_ACACA  82.524  0   206  216  213  95   7.64e-129  1259
sp|P61106|RAB14_HUMAN  tr|L8H946|L8H946_ACACA  73.953  2   215  215  213  100  2.88e-117  1154
sp|P62491|RB11A_HUMAN  tr|L8HJ43|L8HJ43_ACACA  75.829  0   211  216  213  98   9.90e-117  1150
sp|Q8WUD1|RAB2B_HUMAN  tr|L8H2N2|L8H2N2_ACACA  70.046  2   217  216  216  100  4.48e-110  1090
sp|P20340|RAB6A_HUMAN  tr|L8GN15|L8GN15_ACACA  81.287  0   171  208  195  82   1.25e-104  1037
sp|P51149|RAB7A_HUMAN  tr|L8HHF3|L8HHF3_ACACA  71.635  23  208  207  186  100  6.25e-103  1021
sp|P61018|RAB4B_HUMAN  tr|L8HCK2|L8HCK2_ACACA  65.888  11  214  213  204  100  2.29e-102  1018
sp|Q13637|RAB32_HUMAN  tr|L8HAC7|L8HAC7_ACACA  68.681  2   182  225  220  80   9.57e-94   943
sp|P62820|RAB1A_HUMAN  tr|L8HD31|L8HD31_ACACA  69.412  0   170  205  210  83   4.80e-91   915
sp|Q9NP72|RAB18_HUMAN  tr|L8GYY7|L8GYY7_ACACA  58.163  2   196  206  209  95   4.10e-80   816
sp|P51151|RAB9A_HUMAN  tr|L8GPH2|L8GPH2_ACACA  51.961  3   204  201  205  100  1.18e-76   784
sp|P51153|RAB13_HUMAN  tr|L8HGZ3|L8HGZ3_ACACA  58.282  6   163  203  208  80   1.16e-71   739
sp|Q14964|RB39A_HUMAN  tr|L8GSP6|L8GSP6_ACACA  51.124  6   178  217  200  82   8.44e-61   641
sp|Q13636|RAB31_HUMAN  tr|L8GKG8|L8GKG8_ACACA  46.734  9   199  194  218  98   2.73e-58   618
sp|Q969Q5|RAB24_HUMAN  tr|L8H2Q9|L8H2Q9_ACACA  50.549  18  182  203  211  83   6.20e-56   597

6.3.1.7 Filter human-Acanthamoeba blastp RBHs with a single one-liner

This final code runs all the filtering steps presented above for the two blastp output tables using a single neat one-liner

while read ACAid; do grep -w "$ACAid" HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done < <(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u) | sort -gk9,9 | awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print}' | column -t

6.3.2 HOMEWORK 4

Use R to run summary()) on the ACACA_vs_HUMAN_RBHs_m6.out output file to explore summary statistics of Acanthamoeba vs. human RBHs. What is the observed range of alignment length (length), pident, E-value, and score for these Acanthamoeba Rabs?
Based on the insights gained from the previous analysis, narrow down the set of Acanthamoeba paralogs stored in HUMAN_63Rabs_vs_ACACA_10hits_m6.out by filtering more stringently the set with the empirically determined cutoff values. After applying the new filter, how many unique Acanthamoeba homologs are found in the set?
Repeat the RBH exercise to find the Acanthamoeba orthologs of human ARF and ARL reference sequences provided in the HUMAN_25_ARF_and_ARL.faa file. For extra credit, add the qcovs (query-coverage) output to the table and use it to filter hits with >= 80 qcovs.
How many ARL8 Human paralogs does the HUMAN_25_ARF_and_ARL.faa file contain, and how many co-orthologs are found in the ACACA proteome? Use simple Linux filtering tools on the HUMAN_ARF_ARL_vs_ACACA_10hits_m6.out to find the answer. Confirm by blasting the ACACA ARL8 co-orthologs against the ACACA proteome.

6.4 Computing clusters of orthologous protein families with the aid of the compute_RBH_clusters.sh script

In the previous exercise we learned how to identify potential orthologs among pairs of proteomes by implementing a standard reciprocal best hits (RBHs) approach. In this section, we will use the compute_RBH_clusters.sh Bash script to compute orthologous protein clusters automatically, saving them to disk as FASTA files.

The compute_RBH_clusters.sh script extends the approach demonstrated in previous sections, computing RBHs between a set of proteomes (protein FASTA files) and a single reference proteome, which is automatically selected as the smallest one, or defined by the user. The computed RBH clusters (FASTA FILES) are provided as core_clusters and nonCore_clusters.

The latest version of the script is freely available from my GitHub repository.

Fetch the compute_RBH_clusters.sh script from your ~/bin directory with the following commands:

# cd into ~/bin
if [ -d ~/bin ]; then cd ~/bin; else mkdir ~/bin && cd ~/bin; fi

# fetch script and make it executable
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/compute_RBH_clusters.sh
chmod 755 compute_RBH_clusters.sh

# return to our working directory
cd -

Display the help menu of compute_RBH_clusters.sh

compute_RBH_clusters.sh

As seen in the output shown above, the user can customize the script’s parameters, but the defaults have been carefully chosen to be broadly useful. Depending on the available computing resources, you may want to increase the number of threads to use for the blastp processes with the -t option.

6.4.1 Computing clusters of orthologous Rab GTPases between human, Dictyostelium discoideum and Acanthamoeba castellanii with compute_RBH_clusters.sh

To finish our study of Rab GTPase orthologs, we will use compute_RBH_clusters.sh to RBH clusters as FASTA files with the following codes, assuming that we are in the UniProt_proteomes directory.

6.4.1.1 Running the compute_RBH_clusters.sh script

We will run under the default parameterization displayed in the previous section.

# assuming that we are in the UniProt_proteomes directory, make the RBHs subdirectory and cd into it
mkdir RAB_RBHs && cd RAB_RBHs

# generate symlinks to the following source FASTA files
ln -s ../HUMAN_63Rabs_Ran.faa .
ln -s ../ACACA.faa .
ln -s ../DICDI.faa .

# run compute_RBH_clusters.sh under default parameters
compute_RBH_clusters.sh -R 1 -d .

6.4.1.2 Structure and function of the compute_RBH_clusters.sh script

The image above displays the messages that the script prints to STDOUT. A brief description of each of the numbered sections (1-7) is provided below. along with relevant code snippets from the script:

Reports the parameterization of the run. Of particular relevance here are the parameters used for the blastp run.
The script automatically selected the smallest proteome (FASTA file), in this case the file holding the 63 human reference Rab and Ran proteins.

...TRUNCATED

function select_ref { 
     # Automatically selects the smallest genome as the reference.
     local ext
     ext=$1
     
     # find the genome with the smallest number of gene|protein sequences
     for f in *."${ext}"
     do 
          echo -ne "$f\t"
      grep -c '>' "$f"
     done | sort -k2g | awk 'NR == 1{print $1}'
}
...TRUNCATED

print_start_time && echo '# Selecting the smallest genome as the reference'
# automatically select the smallest reference, if not provided as ARG
if [[ -z "$ref" ]]; then
    ref=$(select_ref "$ext")
    echo "  - Selected $ref as the reference genome"
fi

After selecting the reference proteome, the script generates the proteome-specific blast databases with makeblastdb. In addition, the blastdb_aliastool is used to generate an aliased database called allDBs of all proteome-specific dbs.

# -------------------------------------------------------------------
# 2. run makeblastdb on all input proteomes with edited FASTA headers
# -------------------------------------------------------------------
# Each proteome is used to create a blastp database using makeblastdb.
print_start_time && echo "# Generating indexed blastp databases"

for f in *."${ext}"
do 
    makeblastdb -in "$f" -dbtype prot -parse_seqids &> /dev/null
done

print_start_time && echo "# Generating the aliased blastp database allDBs ..."
blastdb_aliastool -dblist_file <(printf '%s\n' "${infiles[@]}") -dbtype prot -out allDBs -title allDBs

printf '%s\n' '-----------------------------------------------------------------------------------------------------'

Section 4 provides information on the progress of the pairwise blastp searches between the reference proteome (query) and each of the non-reference proteomes used as subject dbs. Blastp is run from the run_blastp function shown below, as it contains advanced customizations for optimal calculation of RBHs.

function run_blastp {
   local task q db outfile
   task=$1 # default: blastp-fast
   q=$2
   db=$3
   outfile=$4
   
   # parameterizations in part following suggestions by Julie E. Hernandez-Salmeron�and Gabriel Moreno-Hagelsieb 
   # in BMC Genomics volume 21, Article number: 741 (2020); https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07132-6
   blastp -task "$task" -query "$q" -db "$db" -matrix "$mat" -outfmt "$cols" -num_alignments $num_aln -num_threads $threads -qcov_hsp_perc "$qcov" \
    -seg "$seg" -soft_masking "$mask" -best_hit_overhang "$best_hit_overhang" -best_hit_score_edge "$best_hit_score_edge" -use_sw_tback \
    -evalue "$Eval" -out "$outfile" 2> /dev/null
    
   check_file "$outfile"
}

The final step in this section is to sort each output table by increasing E-value and unique REF vs nonREF hits using AWK hashes, which will be the RBHs, as we have seen in detail in section 6.3.1.

#-----------------------------------
# 3. Run and process pairwise blastp 
#-----------------------------------
# For each non-reference proteome, blastp is run against the reference proteome 
#   (REFvsGENO) and vice versa (GENOvsREF).

for f in "${non_ref[@]}"; do
    ref_vs_geno_blastout=${ref%.*}vs${f%.*}_best_hits.tmp
    geno_vs_ref_blastout=${f%.*}vs${ref%.*}_best_hits.tmp
    
    print_start_time && echo "# Running: run_blastp $task ${ref}ed ${f}ed $ref_vs_geno_blastout" 
    run_blastp "$task" "${ref}"ed "${f}"ed "$ref_vs_geno_blastout"

   # Retrieve the best nonREF proteome database hits using blastdbcmd, onlfy if qcov > \$qcov
   print_start_time && echo "# Retrieving the best hits from $ref_vs_geno_blastout with blastdbcmd ... "
   blastdbcmd -entry_batch <(awk -F"\t" -v qcov="$qcov" '$8 > qcov{print $2}' "$ref_vs_geno_blastout" | sort -u) -db "${f}" > "${ref%.*}vs${f%.*}"_besthits.faa
   check_file "${ref%.*}vs${f%.*}"_besthits.faa
   
   num_hits=$(grep -c '^>' "${ref%.*}vs${f%.*}"_besthits.faa)
   
   if ((num_hits == 0))
   then
        echo "WARNING: no hits in ${ref%.*}vs${f%.*}_besthits.faa"
          rm "${ref%.*}vs${f%.*}"_besthits.faa
        continue
   fi
   
    print_start_time && printf '%s\n' "# Running: run_blastp $task ${ref%.*}vs${f%.*}_besthits.faa ${ref}ed $geno_vs_ref_blastout ..."    
    run_blastp "$task" "${ref%.*}vs${f%.*}"_besthits.faa "${ref}"ed "$geno_vs_ref_blastout"

    # Sort the blastp output table from the preceding search by increasing E-values (in column 9) and decreasing scores (col 10)
    #    & filter out unique REF vs nonREF RBHs using AWK hashes from the sorted blast output table with qcov > $qcov
    print_start_time && echo "# Filtering out unique REF vs nonREF RBHs from the sorted blast output table with qcov > $qcov"
    for GENOid in $(cut -f1 "$geno_vs_ref_blastout" | sort -u)
    do 
       grep "$GENOid" "$ref_vs_geno_blastout"
    done | sort -gk9,9 -gk10,10 | \
           awk -v qcov="$qcov" 'BEGIN{FS=OFS="\t"}{REFid[$1]++; GENOid[$2]++; if(REFid[$1] == 1 && GENOid[$2] == 1 && $8 > qvov) print }' > \
             "${ref_vs_geno_blastout%.*}"_RBHs_qcov_gt"${qcov}".tsv
    
    check_file "${ref_vs_geno_blastout%.*}"_RBHs_qcov_gt"${qcov}".tsv

done

printf '%s\n' '-----------------------------------------------------------------------------------------------------'

This block computes the intersection of REF proteins in all tsv files holding the pairwise REF vs. nonREF blastp results. This intersection will be used in the following code section to compute the core clusters, that is, those shared by all analyzed proteomes.

#-----------------------------------------------------------------------
# 4. Identify REF proteins shared by all tsv files holding pairwise RBHs
#-----------------------------------------------------------------------
# Find the intersections of REFs in all tsv files
print_start_time && echo "# Computing the intersections of REF proteins in all tsv files holding pairwise RBHs ... "
awk '{r[$1]++; if(r[$1] == ARGC-1) print $1}' ./*.tsv > REF_RBH_IDs.list
[[ ! -s REF_RBH_IDs.list ]] && error "could not write REF_RBH_IDs.list"

intersection_size=$(wc -l REF_RBH_IDs.list | awk '{print $1}')
((intersection_size > 0)) && echo -e "${LBLUE}  Found $intersection_size core RBHs shared by ${#non_ref[*]} nonREF proteomes with the $ref reference proteome${NC}"
((intersection_size == 0)) && error "# ERROR: found $intersection_size core orhtologous genes among ${#infiles[*]} input proteomes ..."

printf '%s\n' '-----------------------------------------------------------------------------------------------------'

Section 6 computes the RBH_matrix, a table integrating the RBHs of the pairwise REF vs. nonREF blastp searches making use of the all_clusters hash. The RBH_matrix.tsv table is parsed to generate the core_genome_clusters.tsv nonCore_genome_clusters.tsv tables

#-------------------------------------------------------------------------------------------------------------------------
# 5. Loop over tsv files and generate RBHs_matrix.tsv core_genome_clusters.tsv and nonCore_genome_clusters.tsv tables
#-------------------------------------------------------------------------------------------------------------------------
# Cluster Computation

print_start_time && echo "# Computing clusters of homologous sequences ..."

# The core_ref hash counts the instances of the REFERNCE_IDs in the RBH tables
declare -A core_ref
core_ref=()

# The all_clusters hash is indexed by REFERNCE_IDs 
#   and as its value holds the RBHs as a tab-separated
#   string of IDs from nonREF proteomes    
declare -A all_clusters
all_clusters=()

# Construct the all_clusters hash, indexed by reference proteome, 
#  containing a value a string of tab-separated nonREF proteome RBH IDs.
print_start_time && echo "# Populating the all_clusters hash ..."
for t in *RBHs_*.tsv; do
    [ ! -s "$t" ] && error "file: $t does not exist or is empty"
    while read -r REF QUERY rest
    do
        # count the instances of REFERNCE_IDs in each RBHs_*.tsv tables
    (( core_ref["$REF"]++ ))
    if (( ${core_ref["$REF"]} == 1 ))
    then
        all_clusters["$REF"]="$QUERY" 
    else
        all_clusters["$REF"]="${all_clusters[$REF]}\t$QUERY"
    fi
    done < "$t" || { echo "Failed to process file: $t"; exit 1; } # required test for set -e compliance
done


# 5.1 print the RBHs_matrix.tsv
print_start_time && echo "# Printing the pangenome_matrix, core_genome_clusters, and nonCore_genome_clusters files ..."
for key in "${!all_clusters[@]}"
do
      echo -e "${key}\t${all_clusters[$key]}"
done > RBHs_matrix.tsv
check_file RBHs_matrix.tsv

# 5.2 print the core_genome_clusters.tsv
awk -v ninfiles="${#infiles[@]}" 'NF == ninfiles' RBHs_matrix.tsv > core_genome_clusters.tsv
check_file core_genome_clusters.tsv

# 5.3 print the nonCore_genome_clusters.tsv
awk -v ninfiles="${#infiles[@]}" 'NF != ninfiles' RBHs_matrix.tsv > nonCore_genome_clusters.tsv
check_file nonCore_genome_clusters.tsv warn

printf '%s\n' '-----------------------------------------------------------------------------------------------------'

The final code section reads the RBH_matrix.tsv file and extracts the RBH cluster proteins line by line, writing each line of protein IDs to a temporary (idstmp) file holding the cluster’s protein IDs. Once written to file, the idstmp file names are piped to xargs to pass them to blastdbcmd for parallel execution on all available cores. This is parallelization step is critically speeds up the process of writing cluster (FASTA) files to disk. Diverse benchmark analyses revealed that this (extracting and writing cluster files to disk) is the rate-liminting step. Finally, the core and nonCore clusters are moved to the equally-named directories, and intermediate files, including the blastdbs, are removed to save disk space and tidying the working directory.

#-----------------------------
# 6. Write cluster FASTA files
#-----------------------------

# 6.1 uses blastdbcmd but calling it in parallel with xargs 
# write the each line of the RBHs_matrix.tsv IDs to a tmpfile
#  to pass the list of tmpfiles to a parallel call of blastdbcmd
print_start_time && echo "# Writing each line of the RBHs_matrix.tsv IDs to an idstmp file por parallelization ..."

#initialize cluster counter
c=0
while read -r -a ids
do 
    ((c++)) 
    # write each line of the RBHs_matrix.tsv to a temporal file
    printf '%s\n' "${ids[@]}" > cluster_"${c}".idstmp
done < RBHs_matrix.tsv || { echo "Failed to process file: RBHs_matrix.tsv"; exit 1; } # required test for set -e compliance

# 6.2 Pass the list of tmpfiles to a parallel call of blastdbcmd, or, if not available, call it from xargs
if command -v parallel &> /dev/null 
then 
    print_start_time && echo "  - Running blastdbcmd through parallel with -j \$(nproc) ..."
    find . -name '*.idstmp' | parallel --gnu -j $(nproc) 'blastdbcmd -db allDBs -dbtype prot -entry_batch {} -out {.}.fas'  
else
    # use the more portable find | xargs idiom to parallelize the blastdbcmd call of parallel is not found on host
    print_start_time && echo "  - Running blastdbcmd in parallel with xargs using all available cores \$(nproc) ..."
    find . -name '*.idstmp' -print0 | xargs -0 -P $(nproc) -I % blastdbcmd -db allDBs -dbtype prot -entry_batch % -out %.fas

    # rename *.idstmp.fas cluster files with rename, if available
    if command -v rename &> /dev/null; 
    then 
        rename 's/\.idstmp//' *.fas
    fi
fi

# 6.3 filter out core and nonCore clusters
print_start_time && echo "# Moving core and nonCore clusters to their respective directories ..."

mkdir core_clusters || error "could not mkdir core_clusters"
mkdir nonCore_clusters || error "could not mkdir nonCore_clusters"
for f in cluster_*.fas
do 
    if [[ $(grep -c '^>' "$f") -eq "${#infiles[@]}" ]]
    then 
         mv "$f" core_clusters/core_"${f}" || error "could not mv $f to core_clusters/core_${f}"
    else
         mv "$f" nonCore_clusters/nonCore_"${f}" || error "could not mv $f to nonCore_clusters/nonCore_${f}"
    fi
done

printf '%s\n' '-----------------------------------------------------------------------------------------------------'

6.4.1.3 Output files generated by the compute_RBH_clusters.sh script

Let us now explore the files generated by the script, starting by a listing of the directory’s contents:

# list the directory contentes
ls -ltr

lrwxrwxrwx 1 vinuesa vinuesa     23 oct  8 19:38 HUMAN_63Rabs_Ran.faa -> ../HUMAN_63Rabs_Ran.faa
lrwxrwxrwx 1 vinuesa vinuesa     12 oct  8 19:38 ACACA.faa -> ../ACACA.faa
lrwxrwxrwx 1 vinuesa vinuesa     12 oct 22 18:04 DICDI.faa -> ../DICDI.faa
-rw-rw-r-- 1 vinuesa vinuesa    699 oct 27 18:58 HUMAN_63Rabs_RanvsACACA_best_hits_RBHs_qcov_gt60.tsv.gz
-rw-rw-r-- 1 vinuesa vinuesa    767 oct 27 18:58 HUMAN_63Rabs_RanvsDICDI_best_hits_RBHs_qcov_gt60.tsv.gz
-rw-rw-r-- 1 vinuesa vinuesa   1464 oct 27 18:58 RBHs_matrix.tsv
-rw-rw-r-- 1 vinuesa vinuesa    800 oct 27 18:58 nonCore_genome_clusters.tsv
-rw-rw-r-- 1 vinuesa vinuesa    664 oct 27 18:58 core_genome_clusters.tsv
drwxrwxr-x 2 vinuesa vinuesa   4096 oct 27 18:58 core_clusters
drwxrwxr-x 2 vinuesa vinuesa   4096 oct 27 18:58 nonCore_clusters

A key file is RBHs_matrix.tsv, a table integrating the RBHs of the pairwise REF vs. nonREF blastp searches making use of the all_clusters hash.

column -t RBHs_matrix.tsv

sp|Q9NP72|RAB18_HUMAN  tr|L8GYY7|L8GYY7_ACACA  sp|Q54GY8|RAB18_DICDI
sp|Q15286|RAB35_HUMAN  sp|P34140|RAB1B_DICDI   
sp|P61019|RAB2A_HUMAN  sp|P36409|RAB2A_DICDI   
sp|Q9ULC3|RAB23_HUMAN  sp|Q559X6|RAB2B_DICDI   
sp|P61006|RAB8A_HUMAN  sp|P20790|RAB8A_DICDI   
sp|Q15907|RB11B_HUMAN  sp|P36412|RB11A_DICDI   
sp|Q9H082|RB33B_HUMAN  sp|O76173|RAB1C_DICDI   
sp|P62826|RAN_HUMAN    tr|L8GF16|L8GF16_ACACA  sp|P33519|RAN_DICDI
sp|Q9NX57|RAB20_HUMAN  sp|Q54RX9|RAB5B_DICDI   
sp|P57729|RAB38_HUMAN  tr|L8HAC7|L8HAC7_ACACA  sp|Q54QR3|RB32A_DICDI
sp|Q9UL25|RAB21_HUMAN  sp|P34142|RAB21_DICDI   
sp|P61106|RAB14_HUMAN  tr|L8H946|L8H946_ACACA  sp|P36410|RAB14_DICDI
sp|P62491|RB11A_HUMAN  tr|L8HJ43|L8HJ43_ACACA  
sp|P20339|RAB5A_HUMAN  sp|Q86JP3|RAB5A_DICDI   
sp|P62820|RAB1A_HUMAN  tr|L8HD31|L8HD31_ACACA  sp|P34139|RAB1A_DICDI
sp|P51151|RAB9A_HUMAN  tr|L8GPH2|L8GPH2_ACACA  
sp|Q8WUD1|RAB2B_HUMAN  tr|L8H2N2|L8H2N2_ACACA  
sp|Q14964|RB39A_HUMAN  tr|L8GSP6|L8GSP6_ACACA  
sp|Q96AX2|RAB37_HUMAN  sp|P20791|RAB8B_DICDI   
sp|Q969Q5|RAB24_HUMAN  tr|L8H2Q9|L8H2Q9_ACACA  sp|Q55FU9|RAB24_DICDI
sp|O95755|RAB36_HUMAN  tr|L8H339|L8H339_ACACA  
sp|Q13636|RAB31_HUMAN  tr|L8GKG8|L8GKG8_ACACA  
sp|Q6IQ22|RAB12_HUMAN  tr|L8GIC5|L8GIC5_ACACA  sp|Q54NU2|RAB1D_DICDI
sp|P61018|RAB4B_HUMAN  tr|L8HCK2|L8HCK2_ACACA  sp|Q54DA7|RAB4_DICDI
sp|P51153|RAB13_HUMAN  tr|L8HGZ3|L8HGZ3_ACACA  
sp|P20340|RAB6A_HUMAN  tr|L8GN15|L8GN15_ACACA  sp|Q55FK2|RAB6_DICDI
sp|Q92930|RAB8B_HUMAN  tr|L8HGB4|L8HGB4_ACACA  
sp|P51149|RAB7A_HUMAN  tr|L8HHF3|L8HHF3_ACACA  sp|P36411|RAB7A_DICDI

The RBH_matrix.tsv table is parsed based on the number of fields in each record to generate the core_genome_clusters.tsv (RBHs shared by the three proteomes) and nonCore_genome_clusters.tsv (RBHs shared only between the REF and a nonREF proteome) tables.

Display contents of the core_genome_clusters.tsv table

column -t core_genome_clusters.tsv

sp|Q9NP72|RAB18_HUMAN  tr|L8GYY7|L8GYY7_ACACA  sp|Q54GY8|RAB18_DICDI
sp|P62826|RAN_HUMAN    tr|L8GF16|L8GF16_ACACA  sp|P33519|RAN_DICDI
sp|P57729|RAB38_HUMAN  tr|L8HAC7|L8HAC7_ACACA  sp|Q54QR3|RB32A_DICDI
sp|P61106|RAB14_HUMAN  tr|L8H946|L8H946_ACACA  sp|P36410|RAB14_DICDI
sp|P62820|RAB1A_HUMAN  tr|L8HD31|L8HD31_ACACA  sp|P34139|RAB1A_DICDI
sp|Q969Q5|RAB24_HUMAN  tr|L8H2Q9|L8H2Q9_ACACA  sp|Q55FU9|RAB24_DICDI
sp|Q6IQ22|RAB12_HUMAN  tr|L8GIC5|L8GIC5_ACACA  sp|Q54NU2|RAB1D_DICDI
sp|P61018|RAB4B_HUMAN  tr|L8HCK2|L8HCK2_ACACA  sp|Q54DA7|RAB4_DICDI
sp|P20340|RAB6A_HUMAN  tr|L8GN15|L8GN15_ACACA  sp|Q55FK2|RAB6_DICDI
sp|P51149|RAB7A_HUMAN  tr|L8HHF3|L8HHF3_ACACA  sp|P36411|RAB7A_DICDI

Display contents of the nonCore_genome_clusters.tsv table

column -t nonCore_genome_clusters.tsv

sp|Q15286|RAB35_HUMAN  sp|P34140|RAB1B_DICDI
sp|P61019|RAB2A_HUMAN  sp|P36409|RAB2A_DICDI
sp|Q9ULC3|RAB23_HUMAN  sp|Q559X6|RAB2B_DICDI
sp|P61006|RAB8A_HUMAN  sp|P20790|RAB8A_DICDI
sp|Q15907|RB11B_HUMAN  sp|P36412|RB11A_DICDI
sp|Q9H082|RB33B_HUMAN  sp|O76173|RAB1C_DICDI
sp|Q9NX57|RAB20_HUMAN  sp|Q54RX9|RAB5B_DICDI
sp|Q9UL25|RAB21_HUMAN  sp|P34142|RAB21_DICDI
sp|P62491|RB11A_HUMAN  tr|L8HJ43|L8HJ43_ACACA
sp|P20339|RAB5A_HUMAN  sp|Q86JP3|RAB5A_DICDI
sp|P51151|RAB9A_HUMAN  tr|L8GPH2|L8GPH2_ACACA
sp|Q8WUD1|RAB2B_HUMAN  tr|L8H2N2|L8H2N2_ACACA
sp|Q14964|RB39A_HUMAN  tr|L8GSP6|L8GSP6_ACACA
sp|Q96AX2|RAB37_HUMAN  sp|P20791|RAB8B_DICDI
sp|O95755|RAB36_HUMAN  tr|L8H339|L8H339_ACACA
sp|Q13636|RAB31_HUMAN  tr|L8GKG8|L8GKG8_ACACA
sp|P51153|RAB13_HUMAN  tr|L8HGZ3|L8HGZ3_ACACA
sp|Q92930|RAB8B_HUMAN  tr|L8HGB4|L8HGB4_ACACA

Finally, let us explore the FASTA files in the core_clusters directory holding the RBH clusters shared across the three proteomes analyzed:

list the directory contents:

cd core_clusters && ls

core_cluster_10.fas  core_cluster_15.fas  core_cluster_20.fas  core_cluster_24.fas  core_cluster_28.fas
core_cluster_12.fas  core_cluster_1.fas   core_cluster_23.fas  core_cluster_26.fas  core_cluster_8.fas

Visualize the contents of the 10 core_cluster*.fas files

for f in *fas; do  echo "# $f" && cat $f; echo '------------------------------------------------------------------'; done

# core_cluster_10.fas
>P57729 Ras-related protein Rab-38 OS=Homo sapiens OX=9606 GN=RAB38 PE=1 SV=1
MQAPHKEHLYKLLVIGDLGVGKTSIIKRYVHQNFSSHYRATIGVDFALKVLHWDPETVVRLQLWDIAGQERFGNMTRVYY
REAMGAFIVFDVTRPATFEAVAKWKNDLDSKLSLPNGKPVSVVLLANKCDQGKDVLMNNGLKMDQFCKEHGFVGWFETSA
KENINIDEASRCLVKHILANECDLMESIEPDVVKPHLTSTKVASCSGCAKS
>L8HAC7 Ras-related protein Rab OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_126680 PE=3 SV=1
MAAEQEDGVTEYLYKVLVVGDIGTGKTSIIKRYVHNIFSMHYKSTIGVDFALKVINWDPKTIVRLQLWDIAGQERFGNMT
RVYYKEAVGAFVVFDVTRVNTFEAVQKWKNDIDNKVTLPPDDRPIPVVLLANKCDLAKEGFARNAAQMDKYCEEHGFAGW
EETSAKENINIDKAVKFLVSKILENDINRQNQLQKKDENIVKVGAGGPEPKKAGAGGCCG
>Q54QR3 Ras-related protein Rab-32A OS=Dictyostelium discoideum OX=44689 GN=rab32A PE=1 SV=1
MSNNPADDEGYNEYLYKILVVGDIGTGKTSIIKRFVHNIFSMHYKSTIGVDFALKVINWDPKTEVRLQLWDIAGQERFGS
MTRVYYKEAVGAMITFDVTRMSTFEAVAKWKADIDSKVTYGADEKPIPVVLLANKCDLGKDAFIKTANDMDKYCKDNGFI
GWFETSAKENMNIEKAARFLVDHILKNDVRRNQPIEGTIQPGDLNKQPQPTSTGPSCCK
------------------------------------------------------------------
# core_cluster_12.fas
>P61106 Ras-related protein Rab-14 OS=Homo sapiens OX=9606 GN=RAB14 PE=1 SV=4
MATAPYNYSYIFKYIIIGDMGVGKSCLLHQFTEKKFMADCPHTIGVEFGTRIIEVSGQKIKLQIWDTAGQERFRAVTRSY
YRGAAGALMVYDITRRSTYNHLSSWLTDARNLTNPNTVIILIGNKADLEAQRDVTYEEAKQFAEENGLLFLEASAKTGEN
VEDAFLEAAKKIYQNIQDGSLDLNAAESGVQHKPSAPQGGRLTSEPQPQREGCGC
>L8H946 Ras-related protein Rab-14 OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_279530 PE=3 SV=1
MASFPYEYIFKYIIIGDMGVGKSCLLHQFTEKKFIPDSPHTIGVEFGTRIIEVMGKKIKLQIWDTAGQERFRAVTRSYYR
GAAGALLVYDITRRPTYNHLTSWLTDARNLTNPNTVIMMIGNKKDLEDQRDVTYEEASNFAKENGLIFLEASAKTGENVE
EAFLKTAKLIFQSVQDGSVDVGSDIGVTKKTPLVPPSGGSGGGAGDGAGGCSC
>P36410 Ras-related protein Rab-14 OS=Dictyostelium discoideum OX=44689 GN=rab14 PE=1 SV=2
MSFPYEYIFKYIIIGDMGVGKSCLLHQFTENKFVPDSPHTIGVEFGTRIVDVNNKKIKLQIWDTAGQERFRAVTRSYYRG
AAGALLVYDITRRITYNHLTTWLTDARNLTNPNTVIMLIGNKKDLEGQRDVTYEEASAFAKQNGLIFVESSAKTGENVEE
AFLRTAKLIFQSVQEGNVDLIPDGGITKNPPQTITDKPQDASKCSC
------------------------------------------------------------------
# core_cluster_15.fas
>P62820 Ras-related protein Rab-1A OS=Homo sapiens OX=9606 GN=RAB1A PE=1 SV=3
MSSMNPEYDYLFKLLLIGDSGVGKSCLLLRFADDTYTESYISTIGVDFKIRTIELDGKTIKLQIWDTAGQERFRTITSSY
YRGAHGIIVVYDVTDQESFNNVKQWLQEIDRYASENVNKLLVGNKCDLTTKKVVDYTTAKEFADSLGIPFLETSAKNATN
VEQSFMTMAAEIKKRMGPGATAGGAEKSNVKIQSTPVKQSGGGCC
>L8HD31 Rasrelated protein ORAB-1, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_070630 PE=4 SV=1
MSDYDHLFKILMVGDSGVGKSSLLLRFTDDTFTDNFISTIGVDFKIRTVNLDGKVIKMQIWDTAGQERFRTITSSYYRGA
HGVILVYDVTDQVSFNNARQWLTEIERYACGNVVKLLVGNKSDLVSKRVVSTATGKEFADQFHLPFIEASAKDGSNVKQA
FMTLVKEVYEKVVGDSASSGGGSLGQSDNNKVNVAAAKPEKKKGGLKCIL
>P34139 Ras-related protein Rab-1A OS=Dictyostelium discoideum OX=44689 GN=rab1A PE=2 SV=2
MNPDYHYLFKLLLIGDSGVGKSCLLLRFADDTYSESFISTIGVDFKIRTIELNGKTIKLQIWDTAGQERFRTITSSYYRG
AHGIIVVYDVTDKLTFENVRQWLQEIDRFACENVNKLLVGNKSDLVAKKVVDFNTAKAFADSLQIPFLETSAKQSTNVEQ
AFMTMATEIKNRLTASQPTQTVDKNKVVPGSSAPISPKSGCC
------------------------------------------------------------------
# core_cluster_1.fas
>Q9NP72 Ras-related protein Rab-18 OS=Homo sapiens OX=9606 GN=RAB18 PE=1 SV=1
MDEDVLTTLKILIIGESGVGKSSLLLRFTDDTFDPELAATIGVDFKVKTISVDGNKAKLAIWDTAGQERFRTLTPSYYRG
AQGVILVYDVTRRDTFVKLDNWLNELETYCTRNDIVNMLVGNKIDKENREVDRNEGLKFARKHSMLFIEASAKTCDGVQC
AFEELVEKIIQTPGLWESENQNKGVKLSHREEGQGGGACGGYCSVL
>L8GYY7 Rasrelated protein Rab-18-like, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_369270 PE=4 SV=1
MEGTTGSASHFDHLFKLLIIGDSSVGKSSILLRFTEDEFDDEHPVTIGVDFKVKTIQLGAKRINLTIWDTAGQEKFRSLT
SSYYRGTQGIILVYDVSSRESFQHLSVWLNEIEMYCNNSDVVKLLVANKIDLGDRQVSREEGLAFAKSKAMVFIECSAKT
KLGIQQAFEELVTKILETPSLYEKETKATGTVSMNQDSEEQQEGVCSYC
>Q54GY8 Ras-related protein Rab-18 OS=Dictyostelium discoideum OX=44689 GN=rab18 PE=3 SV=1
MEEDKQYKVLLIGDSDVGKTSIVKRFSDDTFDEDLLCTIGVEFKMKEVKVDGKKVDLCIWDTAGQEKFRALISSYYRGAH
GIILTYDVTKRESFDNLNYWLNEVENFANRSNLVKLLVGNKIDKENREVTREEGAEFAKKKAMLFIECSAKSKIGIQQAF
EELAQKIIEIPQNTSSSQPKQRNTGSVKVEDEPDHNQGVCSC
------------------------------------------------------------------
# core_cluster_20.fas
>Q969Q5 Ras-related protein Rab-24 OS=Homo sapiens OX=9606 GN=RAB24 PE=1 SV=1
MSGQRVDVKVVMLGKEYVGKTSLVERYVHDRFLVGPYQNTIGAAFVAKVMSVGDRTVTLGIWDTAGSERYEAMSRIYYRG
AKAAIVCYDLTDSSSFERAKFWVKELRSLEEGCQIYLCGTKSDLLEEDRRRRRVDFHDVQDYADNIKAQLFETSSKTGQS
VDELFQKVAEDYVSVAAFQVMTEDKGVDLGQKPNPYFYSCCHH
>L8H2Q9 Ras family protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_269260 PE=4 SV=1
MASVDLKVVLLGQSDVGKTCLVDRYLNGKYQDTISPTVGAAFGAKKLYVGDRPLTVGVWDTAGAERFEAMSRLYYRAARA
ACVCYDLTRAETWPKVKFWINELLSFEKECALYIVGTKADLLEEGVPRGVEASEVADYARQINAQVFDALLIPCPHVTFE
TSSKKGYSVEDLFDTIVYDFVKRGAGGEVPLGGGGGGDVITPSEAPPACSC
>Q55FU9 Ras-related protein Rab-24 OS=Dictyostelium discoideum OX=44689 GN=rab24 PE=3 SV=1
MTKTKIDLKVVLLGYASVGKTCIVTRYTSGQFGDTHTTIGGAFSSKRVVVGETEVLLGIWDTAGTERYQAVNVSYYRRAN
AAIVCYDLTNRESWEKVTFWAEELTQNEPEIEIYIVGTKLDLIQQGDIKAVPEEEVKQTARRYKAHIFETSSRTGENVSL
LFQTIAEDFCKRTNNGTNPVNSNPSNVVNVNTQTQKKKGGCC
------------------------------------------------------------------
# core_cluster_23.fas
>Q6IQ22 Ras-related protein Rab-12 OS=Homo sapiens OX=9606 GN=RAB12 PE=1 SV=3
MDPGAALQRRAGGGGGLGAGSPALSGGQGRRRKQPPRPADFKLQVIIIGSRGVGKTSLMERFTDDTFCEACKSTVGVDFK
IKTVELRGKKIRLQIWDTAGQERFNSITSAYYRSAKGIILVYDITKKETFDDLPKWMKMIDKYASEDAELLLVGNKLDCE
TDREITRQQGEKFAQQITGMRFCEASAKDNFNVDEIFLKLVDDILKKMPLDILRNELSNSILSLQPEPEIPPELPPPRPH
VRCC
>L8GIC5 Ras family protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_399850 PE=4 SV=1
MNSDASDTPDYLLKIVIVGDSGVGKSNIMSRFTNDTFEESSKTTIGVAFATKNMAVETSNDSTKDCKKQVKLQVWDTAGQ
ERYRALSSAYYRGAQGALVVYDITSRESLDSIPKWLEEIDKYCTQDVVVTLVGNKLDLSENRCVSVEEGKKVAARENMFF
IETSAKDATNIEKAFTHLIKEIIQSNLSQANNDGKIAGKAPSDTGITLKPPEEPESDPNAPPQPGCSC
>Q54NU2 Ras-related protein Rab-1D OS=Dictyostelium discoideum OX=44689 GN=rab1D PE=1 SV=1
MSMAPEHDFFFKILLIGDSGVGKSCLLLRFADDSWTDTHISTIGVDFKIKTLNLDGKTIKLQIWDTAGQERFRTITSSYY
RGAQGIILVYDCTDQDSFTNVKQWMGEIDRYACENVNKLLVGNKTDLVNEKVVDSNQAKSFAESYGIPFIETSAKNATNV
EECFISMARDIKNRLADIQETPKPDEVDIKSKNKTKSGGKKSFC
------------------------------------------------------------------
# core_cluster_24.fas
>P61018 Ras-related protein Rab-4B OS=Homo sapiens OX=9606 GN=RAB4B PE=1 SV=1
MAETYDFLFKFLVIGSAGTGKSCLLHQFIENKFKQDSNHTIGVEFGSRVVNVGGKTVKLQIWDTAGQERFRSVTRSYYRG
AAGALLVYDITSRETYNSLAAWLTDARTLASPNIVVILCGNKKDLDPEREVTFLEASRFAQENELMFLETSALTGENVEE
AFLKCARTILNKIDSGELDPERMGSGIQYGDASLRQLRQPRSAQAVAPQPCGC
>L8HCK2 RAB4, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_036030 PE=4 SV=1
MAERYNYLFKFIIIGDAATGKSCLLHRFIDNKFMRDSTHTIGVEFGSKIINIQGNNIKLQIWDTAGQVTRSYYRGAAGAL
LVYDITSRESYNHLSTWLEDARTLASAGIVIILVGNKKDLEDEREVTFIEASRFAQENNLMFLETSALTGDNVEEVFLKC
ARSIFSKVEAKAVDPKDSGSGIVENNMARPGSAVATGGSDGCAC
>Q54DA7 Ras-related protein Rab-4 OS=Dictyostelium discoideum OX=44689 GN=rab4 PE=3 SV=1
MGKPLYLVKFIIIGSQSVGKSCLLFRFIDNKFKSQSTHTIGVDFSSRVVDIQGKNVKLQIWDTAGQERFRSVVISYYRGS
AGVALVYDVTNRESYNHITNWLSDVKSLASPDVTIILVGNKADLTEQREVTFLEASRIAQENGLLFMETSALTGEGVEEM
FLKCTRTIMTKIDSGEVNLEHLGVQISSDSGNVTKKPGDSSNCSC
------------------------------------------------------------------
# core_cluster_26.fas
>P20340 Ras-related protein Rab-6A OS=Homo sapiens OX=9606 GN=RAB6A PE=1 SV=3
MSTGGDFGNPLRKFKLVFLGEQSVGKTSLITRFMYDSFDNTYQATIGIDFLSKTMYLEDRTVRLQLWDTAGQERFRSLIP
SYIRDSTVAVVVYDITNVNSFQQTTKWIDDVRTERGSDVIIMLVGNKTDLADKRQVSIEEGERKAKELNVMFIETSAKAG
YNVKQLFRRVAAALPGMESTQDRSREDMIDIKLEKPQEQPVSEGGCSC
>L8GN15 GTPbinding protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_139260 PE=4 SV=1
MAGSNPLAKYKLVFLGDQSVGKTSIITRFMYDTFDNTYQATIGIDFLSKTMYLEDRTIRLQLWDTAGQERFRSLIPSYIR
DSSVAIVVYDITNRTTFLDTARWVEDIRAERGNDVILMLVGNKTDLADKRCVSAEEGESKAKEFGAMFIETSAKAGYNVK
PLFRQVAAALPGIDPIIQKQNTRKWKPAPARCERL
>Q55FK2 Ras-related protein Rab-6 OS=Dictyostelium discoideum OX=44689 GN=rab6 PE=3 SV=1
MESLSKYKLVFLGDQSVGKTSIITRFMYDTFDITYQATIGIDFLSKTMYLEDRTVRLQLWDTAGQERFRSLIPSYIRDSS
VAIVVYDITNKVSFTNTIKWIEDVRNERGNNVVIMLVGNKTDLADKRQVSIEEGEAKAKEYDIMFTETSAKAGFNIKALF
RKVASALPGIDSNSMTKNQHEIISEVDITDGGKPALKEGDGFCSSSRC
------------------------------------------------------------------
# core_cluster_28.fas
>P51149 Ras-related protein Rab-7a OS=Homo sapiens OX=9606 GN=RAB7A PE=1 SV=1
MTSRKKVLLKVIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDRLVTMQIWDTAGQERFQSLGVAFYRG
ADCCVLVFDVTAPNTFKTLDSWRDEFLIQASPRDPENFPFVVLGNKIDLENRQVATKRAQAWCYSKNNIPYFETSAKEAI
NVEQAFQTIARNALKQETEVELYNEFPEPIKLDKNDRAKASAESCSC
>L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_074580 PE=4 SV=1
MSTRKKVLLKIIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDKLVTLQIWDTAGQERFQSLGVAFYRG
ADACVLVFDVNAGPRDPETFPFIVLGNKIDLESSRVVSQKRAQTWCQSKGNIPYFETSAKEAINVEQAFQTIAKNAMKEE
EDVDLAGITDAIQLDKSEPSQSGCSC
>P36411 Ras-related protein Rab-7a OS=Dictyostelium discoideum OX=44689 GN=rab7A PE=1 SV=1
MATKKKVLLKVIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKELMVDDRVVTMQIWDTAGQERFQSLGVAFYRG
ADCCVLCYDVNVAKTFENLDSWRDEFLIQAGPRDPDNFPFVVLGNKIDLENQRVVSQKRAASWCQSKGNIPYFETSAKEA
INVEQAFQTIARNAIKLEDGLVFPIPTNIQVIPEPQPAKSGCC
------------------------------------------------------------------
# core_cluster_8.fas
>P62826 GTP-binding nuclear protein Ran OS=Homo sapiens OX=9606 GN=RAN PE=1 SV=3
MAAQGEPQVQFKLVLVGDGGTGKTTFVKRHLTGEFEKKYVATLGVEVHPLVFHTNRGPIKFNVWDTAGQEKFGGLRDGYY
IQAQCAIIMFDVTSRVTYKNVPNWHRDLVRVCENIPIVLCGNKVDIKDRKVKAKSIVFHRKKNLQYYDISAKSNYNFEKP
FLWLARKLIGDPNLEFVAMPALAPPEVVMDPALAAQYEHDLEVAQTTALPDEDDDL
>L8GF16 GTP-binding nuclear protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_260190 PE=3 SV=1
MQGQVPTFKLILVGDGGVGKTTFVKRHLTGEFEKKYVATLGVEVHPLSFHTNFGPICFNVWDTAGQEKFGGLRDGYYIQG
QCAIIMFDVTSRITYKNVPNWHRDLVRVCENIPIVLVGNKVDVKDRKVKAKAITFHRKKNLQYYDISAKSNYNFEKPFLF
LARKLTNKADLALVESPALAPPEVALDMNQIKQYEQELSLAAQTPLPDEDEDL
>P33519 GTP-binding nuclear protein Ran OS=Dictyostelium discoideum OX=44689 GN=ranA PE=1 SV=1
MAEKEQIKLVLVGDGGVGKTTFVQRHLTGEFEPRYIPTLGVSVHPLIFYTNFGKIHFNVWDTAGQEKFGGLRDGYYIQGN
CAIIMFDVTSRISYKNVPNWHSDLTRVCENIPIVLCGNKVDVKDRKVKPSQIVFHRRYNLSYYDVSAKSNYNFEKPFVWL
TSKLLGNKAVTLVQQPTLKLPETVLDSNLMSLYEKEVADAAALPLPEDNDDL

In summary, the compute_RBH_clusters.sh script is a helpful and efficient tool to compute clusters of core and non-core orthologs shared between a reference proteome an a collection of additional ones, including the generation of the corresponding FASTA files.

6.4.2 HOMEWORK 5

6.4.2.1 Exercise 1

Use the compute_RBH_clusters.sh script to compute the RBHs shared between the HUMAN_25_ARF_and_ARL.faa proteins, and the ACACA.faa, and DICDI.faa proteomes.

cd back into the UniProt_proteomes directory
Generate the ARL_ARF_RBHS dir, cd into it, and make symlinks for the following source files

mkdir ARL_ARF_RBHS && cd ARL_ARF_RBHS
ln -s ../HUMAN_25_ARF_and_ARL.faa .
ln -s ../ACACA.faa .
ln -s ../DICDI.faa .

6.5 EXERCISE 5: PROTEIN IDENTIFICATION/ANNOTATION using blastp

AIMS & data:

run $blastp$ to annotate a protein with its most likely function
identify CDSs (genes) on anonymous DNA strands in the query file: 3cass_amplicons.fna by means of $blastx$. To do so, we need to format a blastx-searchable DB using the protein sequences provided in the file integron_cassettes4blastdb.faa

Background info: The 3cass_amplicons.fna sequences were generated in our laboratory from diverse multidrug-resistant Enterobacteriaceae isolates recovered from polluted rivers in Morelos, Mexico. We used PCR to screen our collection of isolates for the presence of the intI integrase gene with the conserved HS458/HS459 primers, which is a hallmark of class-I integrons. Positive isolates were then subjected to PCR with the conserved CS-F & CS-R primers that amplify gene cassette arrays captured by integrons.

Figure 2. The structure of a class-I integron.

If you like, have a look at this nice video-protocol on Screening Foodstuffs for Class 1 Integrons and Gene Cassettes by Liette S. Waldron and Michael R. Gillings from the Department of Biological Sciences, Macquarie University. June 19, 2015 in J Vis Exp. 2015; (100): 52889. doi: 10.3791/52889.

6.5.1 Prepare working directories

extract the sequence data from the compressed tar file

# Move into the BLAST_DB_AA dir and extract the sequences in the tgz file
cd $wkdir && cd BLAST_DB_AA 

tar -xvzf gene_discovery_and_annotation_using_blastx.tgz

6.5.2 Exploring and fixing FASTA headers for database indexing

6.5.2.1 Explore the FASTA headers

# count sequences
grep -c integron_cassettes4blastdb.faa  # 2297 sequences

# explore headers
grep '^>' integron_cassettes4blastdb.faa | head -20

>gi|17026024 |[Achromobacter xylosoxidans]|AXI2|||||||Japan:Maebashi||clinical strain; synonym:Alcaligenes xylosoxidans||blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|Iyobe_S.|Antimicrob.AgentsChemother.46_2014-2016_2002|12019129
>gi|17026026 |[Acinetobacter baumannii]|ABI1|||||||Japan:Maebashi||clinical strain||blaIMP-11|metallo-beta-lactamase IMP-11|738|AB074436|unpubl
>gi|34482004 |[Vibrio cholerae O1]||||||||||||aadA1|aminoglycoside adenyltransferase|792|AB107663|unpubl
>gi|38524487 |[Vibrio cholerae non-O1/non-O139]|||||Non O1/Non O139|||||||dfrA15|dihydrofolate reductase|474|AB113114|unpubl
>gi|51699488 |[Escherichia coli]|D49||feces|||||China: Guangzhou||patient||aadA1|aminoglycoside adenylyltransferase|792|AB189263|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182
...

6.5.2.2 Fix the headers

# 1. change >gi|17026026 |[Acinetobacter baumannii]|ABI1... ==> 
#             >gnl|casseteDB|17026026 |[Acinetobacter baumannii]|ABI1...
# and 
# 2. simplify the header a bit, to avoid the many empty fields

grep '>' integron_cassettes4blastdb.faa | head -1 | sed 's#|#\n#g' | cat -n

     1  >gi
     2  17026024 
     3  [Achromobacter xylosoxidans]
     4  AXI2
     5  
     6  
     7  
     8  
     9  
    10  
    11  Japan:Maebashi
    12  
    13  clinical strain; synonym:Alcaligenes xylosoxidans
    14  
    15  blaIMP-10
    16  metallo-beta-lactamase IMP-10
    17  741
    18  AB074435
    19  Iyobe_S.
    20  Antimicrob.AgentsChemother.46_2014-2016_2002
    21  12019129

# use an AWK one-liner to do the complete job. Use the list of |-delimited fields listed above to select some of the most relevant ones
# NOTE that we add an extra field in the following transformation: >gi|17026026 => >gnl|casseteDB|17026026; 
#  therefore we need to add one to the field numbers listed above in order to get the desired ones
awk 'BEGIN{FS=OFS="|"}/>/{sub(/^>gi/, ">gnl|casseteDB"); print $1,$2,$3,$4,$16,$17,$18,$19,$22}!/>/{print $1}' integron_cassettes4blastdb.faa | grep '^>' | head -5

>gnl|casseteDB|17026024 |[Achromobacter xylosoxidans]|blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|12019129
>gnl|casseteDB|17026026 |[Acinetobacter baumannii]|blaIMP-11|metallo-beta-lactamase IMP-11|738|AB074436|
>gnl|casseteDB|34482004 |[Vibrio cholerae O1]|aadA1|aminoglycoside adenyltransferase|792|AB107663|
>gnl|casseteDB|38524487 |[Vibrio cholerae non-O1/non-O139]|dfrA15|dihydrofolate reductase|474|AB113114|
>gnl|casseteDB|51699488 |[Escherichia coli]|aadA1|aminoglycoside adenylyltransferase|792|AB189263|16451182

# save the edited faa file as integron_cassettes4blastdb.faaED
awk 'BEGIN{FS=OFS="|"}/>/{sub(/^>gi/, ">gnl|casseteDB"); print $1,$2,$3,$4,$16,$17,$18,$19,$22}!/>/{print $1}' integron_cassettes4blastdb.faa > integron_cassettes4blastdb.faaED

# lets check the result a final time
grep '^>' integron_cassettes4blastdb.faaED | head -20

6.5.3 Formatting a protein database with makeblastdb

# 2) format DB with makeblastdb
# formatdb -i integron_cassettes4blastdb.faaED -p T -o T -n integron_cassetteDB # legacy blast
makeblastdb -in integron_cassettes4blastdb.faaED -dbtype prot -parse_seqids

Building a new DB, current time: 10/18/2022 19:04:31
New DB name:   /home/vinuesa/cursos/intro_biocomp/sesion2_blast/BLAST_DB_AA/integron_cassettes4blastdb.faaED
New DB title:  integron_cassettes4blastdb.faaED
Sequence type: Protein
Deleted existing Protein BLAST database named /home/vinuesa/cursos/intro_biocomp/sesion2_blast/BLAST_DB_AA/integron_cassettes4blastdb.faaED
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 2297 sequences in 0.490659 seconds.

# List the output generated by makeblastdb, sorted by modification time in reverse manner
ls -ltr

-rw-r--r--. 1 vinuesa faculty   4505 ago 31  2011 3cass_amplicons.fna
-rw-r--r--. 1 vinuesa faculty 924971 ago 31  2011 integron_cassettes4blastdb.faa
-rw-r--r--. 1 vinuesa faculty   1129 jul 26  2019 unknown_prot.faa
-rw-r--r--. 1 vinuesa faculty 751713 oct 10 20:15 gene_discovery_and_annotation_using_blastx.tgz
-rw-r--r--. 1 vinuesa faculty 803081 oct 18 19:03 integron_cassettes4blastdb.faaED
-rw-r--r--. 1 vinuesa faculty 566907 oct 18 19:04 integron_cassettes4blastdb.faaED.psq
-rw-r--r--. 1 vinuesa faculty   9220 oct 18 19:04 integron_cassettes4blastdb.faaED.pog
-rw-r--r--. 1 vinuesa faculty  18520 oct 18 19:04 integron_cassettes4blastdb.faaED.pin
-rw-r--r--. 1 vinuesa faculty 329297 oct 18 19:04 integron_cassettes4blastdb.faaED.phr
-rw-r--r--. 1 vinuesa faculty   9192 oct 18 19:04 integron_cassettes4blastdb.faaED.pto
-rw-r--r--. 1 vinuesa faculty  27572 oct 18 19:04 integron_cassettes4blastdb.faaED.pot
-rw-r--r--. 1 vinuesa faculty  16384 oct 18 19:04 integron_cassettes4blastdb.faaED.ptf
-rw-r--r--. 1 vinuesa faculty  63027 oct 18 19:04 integron_cassettes4blastdb.faaED.pos
-rw-r--r--. 1 vinuesa faculty 188416 oct 18 19:04 integron_cassettes4blastdb.faaED.pdb

6.5.4 Running $blastp$ to annotate proteins

The aim of the exercise is to assign functions or annotations to the problem proteins found in file unknown_prot.faa

6.5.4.1 Explore the unknown_prot.faa file with the query sequences

# How many proteins sequences are stored in unknown_prot.faa
grep -c '^>' unknown_prot.faa

cat unknown_prot.faa

# How do we run blast? => explore the program's help output
blastp -h     # short format
blastp -help  # long  format

6.5.4.2 Runing a standard blastp search with tabular output (-outfmt 6)

# launch a standard blastp search, asking the program to provide a tabular format for the top 10 hits
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt 6 -num_alignments 10

6.5.4.3 Improving the tabular output with custom fields

#>>> Lets add some interesting information to the table, missing in the standard -outfmt 6 output
# run blastp using outfmt 6 and passing a user-specified formatting string
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out

# prepare a header file with the field names for easier interpretation of the blastp ouptput table
echo -e 'qseqid\tsseqid\tqlen\tslen\tqstart\tqend\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
cat header1.tab

# add the header to the output table
cat header1.tab blastp_10hits.out > blastp_10hits.tsv
column -t blastp_10hits.tsv | head -20

qseqid                                sseqid                   qlen  slen  qstart  qend  sstart  send  length  pident   positive  mismatch  gaps  evalue     bitscore  qcovs
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|87128428   266   267   1       266   1       266   266     100.000  266       0         0     0.0        537       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|145321080  266   267   1       266   1       266   266     99.624   266       1         0     0.0        536       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|44844209   266   267   1       266   1       266   266     99.624   266       1         0     0.0        536       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|102231612  266   267   1       266   1       266   266     99.248   266       2         0     0.0        535       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|38492196   266   267   1       266   1       266   266     90.226   256       26        0     0.0        497       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|118402723  266   267   1       266   1       266   266     89.850   255       27        0     0.0        494       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|66277418   266   267   1       266   1       266   266     89.098   252       29        0     3.06e-179  489       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|148455769  266   267   1       266   1       266   266     87.594   248       33        0     1.83e-176  483       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|30014072   266   266   1       266   1       265   266     73.684   223       69        1     6.02e-146  405       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|17026024   266   247   43      252   24      227   211     33.175   120       133       8     1.82e-37   128       79
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|56691997   390   387   5       390   1       386   386     99.482   384       2         0     0.0        789       99
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|146151072  390   383   5       388   1       382   388     45.361   251       202       10    2.45e-114  335       98
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|40287460   390   382   13      387   5       381   379     39.314   232       224       6     7.18e-96   287       96
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|133905118  390   410   13      387   33      409   379     39.578   231       223       6     5.97e-95   286       96
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|76057210   390   380   1       387   1       379   396     38.131   232       219       26    4.63e-88   267       99
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|146151112  390   276   316     380   99      165   67      22.388   34        50        2     2.4        25.8      17
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|133905089  390   126   4       37    2       35    34      38.235   22        21        0     5.8        23.9      9
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|161087277  390   80    133     201   15      78    79      25.316   34        34        25    6.9        23.1      18
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|133905132  390   101   255     287   71      98    33      36.364   18        16        5     7.4        23.1      8
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|133905029  390   337   100     154   247     287   55      21.818   24        29        14    7.8        24.3      14

6.5.5 Parsing the output table to select only high confidence annotations

As in the previous exercise on 16S rDNA sequence classification, for a confident annotation of a protein it is important to consider some basic alignment statistics.

In our particular case, since the query proteins correspond to gene cassettes encoding for antibiotic resistance, we are interested in identifying proteins down to the variant level. Therefore, we will impose very stringent filtering criteria to keep only hits with high pident and qcov.

6.5.5.1 Generate a numbered list of header fields for convenient referencing

# Print numbered list of fasta header fields to filter hits with pident >= 95% and qcovs >= 97%
sed 's/\t/\n/g' header1.tab | cat -n

     1  qseqid
     2  sseqid
     3  qlen
     4  slen
     5  qstart
     6  qend
     7  sstart
     8  send
     9  length
    10  pident
    11  positive
    12  mismatch
    13  gaps
    15  bitscore
    16  qcovs

6.5.5.2 Filter the table using AWK

# Filter out hits with >= 95% pident && qcovs >= 97%
awk 'BEGIN{FS=OFS="\t"} $10 >= 95 && $16 >= 97' blastp_10hits.tsv | column -t

qseqid                                sseqid                   qlen  slen  qstart  qend  sstart  send  length  pident   positive  mismatch  gaps  evalue  bitscore  qcovs
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|87128428   266   267   1       266   1       266   266     100.000  266       0         0     0.0     537       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|145321080  266   267   1       266   1       266   266     99.624   266       1         0     0.0     536       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|44844209   266   267   1       266   1       266   266     99.624   266       1         0     0.0     536       100
gi|11037699|gb|AAG27703.1|AF300454_1  gnl|casseteDB|102231612  266   267   1       266   1       266   266     99.248   266       2         0     0.0     535       100
gi|9294830|gb|AAF86698.1|AF180959_1   gnl|casseteDB|56691997   390   387   5       390   1       386   386     99.482   384       2         0     0.0     789       99

6.5.5.3 Add a header line to name the fields and save a filtered summary table to file

# add a subject title (stitle), to get the description line of the proteins we are finding as top hits
echo -e 'qseqid\tsseqid\tqlen\tslen\tstitle\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab

blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out

cat header1.tab blastp_10hits.out > blastp_10hits.tsv
cut -f1,5,7,13 blastp_10hits.tsv | head -20 | column -ts $'\t'

qseqid                                stitle                                                                                                                      pident   qcovs
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344|                                               100.000  100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas putida]|blaVIM-6|metallo-beta-lactamase VIM-6|801|EF522838|18757180                                           99.624   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM|class B beta-lactamase|801|AJ628983|                                                       99.624   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-2|metallo-beta-lactamase VIM-2|801|DQ522233|                                               99.248   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-4|metallo-beta-lactamase VIM-4|801|AY460181|                                               90.226   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-1|VIM-1 metallo-beta-lactamase|801|AJ969237|                                               89.850   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Enterobacter cloacae]|blaVIM-5|metallo-beta-lactamase VIM-5|801|DQ023222|16189133                                         89.098   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-13|class B carbapenemase VIM-13|801|EF577407|18644957                                      87.594   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-7|metallo-b-lactamase|798|AJ536835|14693560                                                73.684   100
gi|11037699|gb|AAG27703.1|AF300454_1  |[Achromobacter xylosoxidans]|blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|12019129                                 33.175   79
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793                                                        99.482   99
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]||cmy-8|1149|EF382672|17526756                                                                      45.361   98
gi|9294830|gb|AAF86698.1|AF180959_1   |[Escherichia coli]|cmy-13|CMY-13|1146|AY339625|16048979                                                                    39.314   96
gi|9294830|gb|AAF86698.1|AF180959_1   |[Salmonella enterica subsp. enterica serovar Newport str. SL254]|blaCMY-2-1|CMY-2 beta-lactamase 1|1230|CP000604|17375195  39.578   96
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]|ampC|beta-lactamase class C DHA-1|1140|AJ971343|16436717                                           38.131   99
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]||sfpB|828|EF382672|17526756                                                                        22.388   17
gi|9294830|gb|AAF86698.1|AF180959_1   |[Salmonella enterica subsp. enterica serovar Newport str. SL254]||hypothetical protein|378|CP000604|17375195               38.235   9
gi|9294830|gb|AAF86698.1|AF180959_1   |[Salmonella enterica subsp. enterica serovar Choleraesuis]|orf20|Orf20|240|EU219534|                                       25.316   18
gi|9294830|gb|AAF86698.1|AF180959_1   |[Salmonella enterica subsp. enterica serovar Newport str. SL254]||conserved hypothetical protein|303|CP000604|17375195     36.364   8

What can we learn from the output above? Examine for example the first two hits.

6.5.5.4 Annotate the query proteins using the single best hit and generate a simplified overview table of the top-scoring results

#>>> retrieve the best hit found for each input gene
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 1 > blastp_tophits.out

cat header1.tab blastp_tophits.out > blastp_tophits.tsv
cut -f1,5,7,13 blastp_tophits.tsv | head | column -ts $'\t'

qseqid                                stitle                                                                                        pident   qcovs
gi|11037699|gb|AAG27703.1|AF300454_1  |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344|                 100.000  100
gi|9294830|gb|AAF86698.1|AF180959_1   |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793                          99.482   99
gi|15705413|gb|AAL05630.1|AF395881_1  |[Klebsiella pneumoniae]|bla|extended-spectrum beta-lactamase CTX-M-19|876|AF458080|12936998  52.158   95

6.5.6 Retrieve the top-scoring hits from the database using blastdbcmd and a list of database identifiers (qseqid)

We can easily retrieve the top-scoring hits from the database using blastdbcmd using a list of qseqids, as shown below.

# get the sseqids only for the top hits with % sequence identity >= 95 and query coverage >= 97%
awk 'BEGIN{FS=OFS="\t"}NR > 1 && $7 >= 95 && $13 >= 97{print $2}' blastp_tophits.tsv > sseqids_for_top_hits_with_ge95pident_ge97qcovs.list

# finally retrieve the sequences for the sseqids_for_top_hits_with_ge95pident_ge97qcovs.list from the integron_cassettes4blastdb
blastdbcmd -db integron_cassettes4blastdb.faaED -entry_batch sseqids_for_top_hits_with_ge95pident_ge97qcovs.list -dbtype prot -out top_hits_with_ge95pident_ge97qcovs.faa

cat top_hits_with_ge95pident_ge97qcovs.faa


>casseteDB:87128428 |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344|
MFKLLSKLLVYLTASIMAIASPLAFSVDSSGEYPTVSEIPVGEVRLYQIADGVWSHIATKSFDGAVYPSNGLIVRDGDEL
LLIDTAWGAKNTAALLAEIEKQIGLPVTRAVSTHFHDDRVGGVDVLRAAGVATYASPSTRRLAEVEGSEIPTHSLEGLSS
SGDAVRFGPVELFYPGAAHSTDNLVVYVPSASVLYGGCAIYELSRTSAGNVADADLAEWPTSIERIQQHYPEAQFVIPGH
GLPGGLDLLKHTTNVVKAHTNRSVVE*
>casseteDB:56691997 |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793
MQNTLKLLSVITCLAATVQGALAANIDESKIKDTVDDLIQPLMQKNNIPGMSVAVTVNGKNYIYNYGLAAKQPQQPVTEN
TLFEVGSLSKTFAATLASYAQVSGKLSLDQSVSHYVPELRGSSFDHVSVLNVGTHTSGLQLFMPEDIKNTTQLMAYLKAW
KPADAAGTHRVYSNIGTGLLGMIAAKSLGVSYEDAIEKTLLPQLGMHHSYLKVPADQMENYAWGYNKKDEPVHVNMEILG
NEAYGIKTTSSDLLRYVQANMGQLKLDANAKMQQALTATHTGYFKSGEITQDLMWEQLPYPVSLPNLLTGNDMAMTKSVA
TPIVPPLPPQENVWINKTGSTNGFGAYIAFVPAKKMGIVMLANKNYSIDQRVTVAYKILSSLEGNK*

6.5.7 A simple strategy to conveniently generate a custom BLAST table header based on the tabular output with comments (-outfmt 7), making use of the Shell’s process substitution syntax

If you read the detailed documentation of $blastp$ blastp -help, you will find under the output section that -outfmt 7 is a tabular with comments.

Here a few lines of code to generate a standard tabular output with a customized header, parsing the -outfmt 7 output. Note the use of cmd <(command pipeline), a syntax called process substitution, which allows to use the ouptut of $command\ pipeline$ as if it were a file for input into $cmd$. This is convenient, as it avoids the need of writing an intermediate file.

# 1. Run blast[p] with -outfmt 7 and the desired output fields
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '7 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blast_m7.tmp

# 2. use process substitution to parse the -outfmt 7 output file blast_m7.tmp, and generate the standard tabular optput with labeled columns
cat <(awk '/^# Fields:/ {sub(/# Fields: /, ""); gsub(/, /, "\t"); gsub(/ /, "_"); print $0; exit}' blast_m7.tmp) <(awk '!/^# /{print $0 }' blast_m7.tmp) | less

6.5.8 HOMEWORK 6

Based on the discussion on the blastp output in the previous section (7.2.4 onwards), what can you conclude about the classification of the third query sequence?
Run a blastp search on the NCBI portal against the nr database, restricting the number of hits to 10.
- Based on these results, how would you classify protein AAL05630.1?
- Explore the different output formats and data download options offered by the server
- Upload the “Hit Table (text)” returned by the NCBI blast server to moodle
- Convert the “Descriptions Table (CSV)” to TSV, skipping the last field (with the url) and upload both files to moodle, including the code line you used for the table cleanup operation
Based on the tabular output field formatting options we’ve learned about in this section, simplify the blastn call from exercise 6.1.4 to generate a table with the following fields: qseqid stitle qlen slen length pident
Repeat the blastp analysis using protein AAL05630.1 as query to interrogate the swissprot and refseq_protein databases (with 10 hits each), which you can access from buluc or tepeu at /export/space2/data/blast/db
- upload the corresponding results tables with the ‘qseqid stitle qlen slen length pident’ fields
- based on the evidence of these blastp results, what would be the best annotation for AAL05630.1?

7 Translating BLAST ($BLASTX$) - translate a DNA sequence to search a protein database

The aim here is to identify the possible gene-cassettes encoded in the PCR amplicons collected in 3cass_amplicons.fna, as explained in the previous background info section.

7.1 EXERCISE 5: Identification of CDSs in DNA sequences using $BLASTX$

BLASTX is generally used to find CDSs (protein-coding sequences) in genomic DNA or transcripts. As we will see, BLASTX is a powerful gene-finding tool, but we will need to examine many alignments to ensure that all CDSs encoded on a given DNA segment are found. For large DNA strings (>5-10 kb), I would recommend to segment the string into shorter, overlapping pieces, before running BLASTX

Remember that BLASTX uses a DNA query, which is translated in its 6 frames using by default the universal code, and the resulting products are used to interrogate a protein database.

7.1.1 Running blastx with additional fields in the output table

# run blastx asking for the top 3 hits in tabular output format and an evalue cutoff set at 1e-100
#  adding relevant information to the table, missing in the standard -m 6 output, like qlen, selen, qframe & qcovs
echo -e 'qseqid\tsseqid\tqlen\tslen\tqstart\tqend\tqframe\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastx -query 3cass_amplicons.fna -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-100 -outfmt '6 qseqid sseqid qlen slen qstart qend qframe sstart send length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 3 -out 3blastX_hits_m6_header1.tab

# if 3blastX_hits_m6_header1.tab exists and has a non-zero size, then execute the following commands ...
[ -s 3blastX_hits_m6_header1.tab ] && cat header1.tab 3blastX_hits_m6_header1.tab > ed && mv ed 3blastX_hits_m6_header1.tab

head 3blastX_hits_m6_header1.tab | column -t

qseqid  sseqid                   qlen  slen  qstart  qend  qframe  sstart  send  length  pident  positive  mismatch  gaps  evalue     bitscore  qcovs
Am1     gnl|casseteDB|146151084  1706  273   887     1699  2       2       273   272     98.897  270       2         1     0.0        536       48
Am1     gnl|casseteDB|157674384  1706  270   899     1699  2       4       270   267     99.625  266       1         0     0.0        533       47
Am1     gnl|casseteDB|161087337  1706  268   899     1699  2       2       268   267     99.625  266       1         0     0.0        533       47
Ap18    gnl|casseteDB|90265366   1762  277   39      836   3       11      276   266     95.865  258       11        0     0.0        508       45
Ap18    gnl|casseteDB|84095097   1762  285   39      836   3       19      284   266     95.489  258       12        0     3.97e-180  506       45
Ap18    gnl|casseteDB|156628012  1762  285   39      836   3       19      284   266     95.113  257       13        0     2.43e-179  504       45
Ap26    gnl|casseteDB|156628012  945   285   16      870   1       1       285   285     99.649  285       1         0     0.0        573       90
Ap26    gnl|casseteDB|133905080  945   336   1       870   1       47      336   290     97.241  285       8         0     0.0        572       92
Ap26    gnl|casseteDB|84095097   945   285   16      870   1       1       285   285     99.649  284       1         0     0.0        572       90

Explore in detail the results in the qlen, slen, qstart, qend, length and qcovs columns
What do you notice regarding the hits of Am1 and Ap18?
In what important aspect do the former two sequence hits differ from those retrieved for Ap26?

7.1.2 Many hits are required to get HSPs aligning over the entire DNA querey sequence and allow the identification of smaller CDSs

As we saw in the output from the previous $blastx$ run, the hits for each query sequence align only to a certain portion of the latter. We will need to substantially increase the number of hits to have a chance to get lower-scoring HSPs that align to other regions (smaller genes) of the DNA sequence used as query.

7.1.2.1 Increasing the number of alignments to 100 and adding the stitle field to the output table

#>>> Lets try to improve the analysis by increasing the no. of hits for each sequence to 100
#    and adding the stitle column to the output table
echo -e 'qseqid\tsseqid\tstitle\tqlen\tslen\tqstart\tqend\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs\tqframe' > header2.tab

#>>> print a numvered list of the positional indexes of each column
sed 's/\t/\n/g' header2.tab | nl

     1  qseqid
     2  sseqid
     3  stitle
     4  qlen
     5  slen
     6  qstart
     7  qend
     8  sstart
     9  send
    10  length
    11  pident
    12  positive
    13  mismatch
    14  gaps
    15  evalue
    16  bitscore
    17  qcovs
    18  qframe

blastx -query 3cass_amplicons.fna -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-10 -outfmt '6 qseqid sseqid stitle qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs qframe' -num_alignments 100 -out 100blastX_hits_m6_header2.tab

[ -s 100blastX_hits_m6_header2.tab ] && cat header2.tab 100blastX_hits_m6_header2.tab > ed && mv ed 100blastX_hits_m6_header2.tab

7.1.2.2 Print a table containing only selected fields

# print relevant table fields
cut -f1,2,6,7,11,17,18 100blastX_hits_m6_header2.tab | column -t

qseqid  sseqid                   qstart  qend  pident   qcovs  qframe
Am1     gnl|casseteDB|146151084  887     1699  98.897   48     2
Am1     gnl|casseteDB|157674384  899     1699  99.625   47     2
Am1     gnl|casseteDB|161087337  899     1699  99.625   47     2
Am1     gnl|casseteDB|10444107   908     1699  100.000  46     2
Am1     gnl|casseteDB|21898659   908     1699  99.621   46     2
...
Am1     gnl|casseteDB|38491398   908     1693  72.137   46     2
Am1     gnl|casseteDB|45502091   908     1516  87.685   36     2
Am1     gnl|casseteDB|10444105   3       500   100.000  29     3
Am1     gnl|casseteDB|115605632  3       500   99.398   29     3
Am1     gnl|casseteDB|48526059   3       500   99.398   29     3
Am1     gnl|casseteDB|56541576   3       500   98.795   29     3
Am1     gnl|casseteDB|110084170  899     1678  60.769   46     2
Am1     gnl|casseteDB|161087327  3       488   100.000  28     3
Am1     gnl|casseteDB|110588863  908     1678  60.311   45     2
...
Am1     gnl|casseteDB|10444106   612     902   98.969   17     3
Am1     gnl|casseteDB|6003516    612     902   97.938   17     3
Am1     gnl|casseteDB|56541577   612     902   97.938   17     3
Am1     gnl|casseteDB|21898658   612     902   94.845   17     3
Am1     gnl|casseteDB|14091025   588     902   78.095   18     3
Am1     gnl|casseteDB|110588887  612     902   78.351   17     3
Am1     gnl|casseteDB|38492198   588     887   77.000   18     3
Am1     gnl|casseteDB|134035531  612     887   76.087   16     3
Am1     gnl|casseteDB|82698267   612     875   76.136   15     3
Am1     gnl|casseteDB|82698261   612     902   76.289   17     3
Am1     gnl|casseteDB|111661582  18      488   36.709   28     3
Am1     gnl|casseteDB|164449571  18      488   38.608   28     3
...

What can we conclude from the selected output lines shown above for Am1 hits?

7.1.2.3 Generate a list of non-redundant qstart & qend positions for each query sequence

#>>> This 1liner generates a non-redundant list of the qstart qend positions for each of the query sequences. 
#    These are extracted from the full table with $(cut -f1 100blastX_hits_m6_header2.tab | grep -v qseqid | sort -u)
#    Then, for each query, grep it out of the table with grep $query 100blastX_hits_m6_header2.tab and keep only the fields 6,7: qstart, qend
#    We separte and label the results corresponding to each query sequence using the echo "#>>> $query"; as header 
#    End the corresponding rerport with echo '------------------------';
for query in $(cut -f1 100blastX_hits_m6_header2.tab | sed '1d' | sort -u)
do 
     echo "#>>> $query"
     grep $query 100blastX_hits_m6_header2.tab | cut -f6,7 | sort -u
     echo '------------------------'
done

##### OUTPUT of the for loop from the previous block
#>>> Am1
1361    1699
15  488
18  488
33  488
3   488
3   500
588 887
588 902
612 875
612 887
612 902
887 1699
899 1678
899 1699
908 1303
908 1516
908 1678
908 1684
908 1693
908 1696
908 1699
920 1699
------------------------
#>>> Ap18
1046    1762
1052    1762
1079    1528
1079    1702
1085    1738
1088    1738
1118    1738
1130    1738
1139    1738
36  836
39  812
39  836
48  173
48  443
48  656
48  797
48  812
48  830
48  836
48  860
501 836
60  836
------------------------
#>>> Ap26
1   192
16  867
16  870
1   870
532 870
58  870
61  870
70  870
79  204
79  474
79  687
79  861
79  864
79  867
79  870
91  870
------------------------

Focus on the Am1 query sequence: How many genes may it encode for?

# Identify the most likely gene start|end positions
grep Am1 100blastX_hits_m6_header2.tab | cut -f6,7 | sort | uniq -c | sort -nk2

      1 3   488
      4 3   500
      2 15  488
     13 18  488
      2 33  488
      1 588 887
      1 588 902
      1 612 875
      1 612 887
      6 612 902
      1 887 1699
      1 899 1678
      6 899 1699
      1 908 1303
      1 908 1516
      2 908 1696
     35 908 1699
      3 908 1693
      4 908 1684
      6 908 1678
      7 920 1699
      1 1361    1699

Ahaaaa! What can you learn from this output?
looks like if there are several genes encoded in each query sequence, particularly in the larger ones like Am1 (and Ap18, if you repeat the exercise for it)

7.1.2.4 Detailed analysis of the candidate CDSs/genes identified

#>>> Lets explore this in more detail. We'll start with the Am1 query
#    One class of hits seems to start at position 3, ending at around position 500
#    What are the hits? Lets grep them out of the table; 
#    note the use of a regular expression '3[[:space:]]500' which denotes a 3 followed by a space or tab, 
#    followed by a 500 (i.e. our qstart qend coordinates)
grep Am1 100blastX_hits_m6_header2.tab | grep '3[[:space:]]500' | cut -f1,3,6,7,11,17,18 | column -ts $'\t'

# Or somewhat more flexibly, using AWK, which will allow us to include the header in the output, as shown below
awk 'BEGIN{FS=OFS="\t"}NR==1 || (/^Am1/ && $6 == 3 && $7 == 500){print $1,$3,$6,$7,$17,$18}' Am1 100blastX_hits_m6_header2.tab | column -ts $'\t'

qseqid  stitle                                                                                       qstart  qend  qcovs  qframe
Am1     |[Serratia marcescens]|dhfrXII|dihydrofolate reductase type XII|498|AF284063|                3       500   29     3
Am1     |[Acinetobacter baumannii]|dhrfXII|dihydrofolate reductase|498|DQ995286|19365688             3       500   29     3
Am1     |[Salmonella enterica subsp. enterica serovar Choleraesuis]|dhfr|Dhfr|498|AY551331|15145503  3       500   29     3
Am1     |[Enterococcus faecalis]|dhfrXII|dihydrofolate reductase|498|AB196348|16451182               3       500   29     3

7.1.2.5 Retrieve the best hit found using blastdbcmd


#>>> Retrieve the best hit gnl|casseteDB|10444105 found by the search in the previous step
blastdbcmd -db integron_cassettes4blastdb.faaED -entry 'gnl|casseteDB|10444105' -dbtype prot -out S_marcescens_dhrXII_casseteDB-10444105.faa

#>>> Retrieve all hits found by the search in the previous step for the Am1 gene starting at pos 899 and ending at pos 1699
grep Am1 100blastX_hits_m6_header2.tab | grep '899[[:space:]]1699'
grep Am1 100blastX_hits_m6_header2.tab | grep '899[[:space:]]1699' | cut -f2 > Am1_qs899_qe1699_hit_IDs.list
blastdbcmd -db integron_cassettes4blastdb.faaED -entry_batch Am1_qs899_qe1699_hit_IDs.list -dbtype prot -out Am1_qs899_qe1699_AadA_best_hits.faa

7.1.3 Graphical exploration of standard -outfmt 6 output using the with blast-imager.pl script

As we’ve learned in the example above, we will need to explore many alignments from a BLASTX search in order to make sure that we identify all CDSs on a DNA string. The blast-imager.pl script is a very convenient tool to visualize results, when we have large lists of hits matching on different parts of our query sequence, exactly as in our case.

To run the script, we will first split the multifasta of query sequences into individual sequences, to generate graphs the the hit table of individual queries.

7.1.3.1 Copy or fetch the scripts and place them in your $HOME/bin directory


# 1. If workin on tepeu or buluc, copy required scripts to your $HOME/bin directory
[ ! -d $HOME/bin ] && mkdir $HOME/bin
cp /home/vinuesa/cursos/intro_biocomp/scripts/*.* $HOME/bin


# 2. if you don't have access to the servers @ LCG, fetch the scripts directly from the command line
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/blast-imager.pl
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/fasta_toolkit.awk
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/split_fasta.pl

chmod 755 blast-imager.pl fasta_toolkit.awk split_fasta.pl

7.1.3.2 Split multifasta into the individual source sequence files

# Split the multifasta into the individual source sequence files, 
# that is, generate single fasta files from the multifasta source file
split_fasta.pl 3cass_amplicons.fna
ls -ltr | tail -3

-rw-r--r--. 1 vinuesa faculty    967 oct 18 21:54 Ap26.fa
-rw-r--r--. 1 vinuesa faculty   1798 oct 18 21:54 Ap18.fa
-rw-r--r--. 1 vinuesa faculty   1740 oct 18 21:54 Am1.fa

7.1.3.3 Run $blastx$ within a for loop piping its output to blast-imager.pl

# Run the blast-imager.pl script
# i) first writing the tables to file
#for file in *fa
#do 
#   blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 -out ${file%.fa}_blastxout.tab; 
#   ./blast-imager.pl ${file%.fa}_blastxout.tab > ${file%.*}.png
#   rm ${file%.fa}_blastxout.tab
#done

# ii) or directly to the graph, by piping the blastx oputput directly into the blast-imager script like so:
#for file in *fa; do blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 | ../../blast-imager.pl > ${file%.*}.png; done

# Run again, imposing some filtering to remove the poorest hits: pident > 98%; mismatches < 4; gaps < 3 
# which will render a better figure
for file in *fa; do blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 | awk '$3 > 98 && $5 < 4 && $6 < 3' | blast-imager.pl > ${file%.*}.png; done

# Have a look at the results for the three query sequences using the "eye of GNOME - eog" 
eog Am1.png &
eog Ap18.png &
eog Ap26.png &

# or all at once:
eog *.png &

Figure 3. The graphical output generated by blast-imager.pl for the 100 Am1 blastx hits

7.1.3.4 Write a table with the likely gene coordinates for Am1

The graphical display shown above in Fig. 3 and the numeric coordinates extracted previously from the hit table, strongly suggest that there are 3 gene cassettes encoded in the amplicon generated for Am1.

      1 3   488
      4 3   500
      2 15  488
     13 18  488
      2 33  488
      1 588 887
      1 588 902
      1 612 875
      1 612 887
      6 612 902
      1 887 1699
      1 899 1678
      6 899 1699
      1 908 1303
      1 908 1516
      2 908 1696
     35 908 1699
      3 908 1693
      4 908 1684
      6 908 1678
      7 920 1699
      1 1361    1699

Looking closely at the coordinates and figure 3, we can write the following table with the likely gene coordinates

Table 1. Coordinates for gene cassettes encoded in the Am1 amplicon DNA sequence.

start	end	strand	note
3	500	+	-
612	902	+	The 887 and 899 starts would overlap with the first gene; this is not typical for gene cassettes
908	1699	+	-

7.1.3.5 Extract the individual cassette genes with $fasta\_toolkit.awk$ and check for possible start and end condons

We will use the fasta_toolkit.awk script to extract the three gene cassettes from the corresponding DNA string using the their coordinates, as defined in Table 1.

1st gene

fasta_toolkit.awk -R 3 -s 3 -e 500 Am1.fa

>Am1
ATGAACTCGGAATCAGTACGCATTTATCTCGTTGCTGCGATGGGAGCCAATCGGGTTATTGGCAATGGTCCTAATATCCCCTGGAAAATTCCGGGTGAGCAGAAGATTTTTCGCAGACTCACTGAGGGAAAAGTCGTTGTCATGGGGCGAAAGACCTTTGAGTCTATCGGCAAGCCTCTACCGAACCGTCACACATTGGTAATCTCACGCCAAGCTAACTACCGCGCCACTGGCTGCGTAGTTGTTTCAACGCTGTCGCACGCTATCGCTTTGGCATCCGAACTCGGCAATGAACTCTACGTCGCGGGCGGAGCTGAGATATACACTCTGGCACTACCTCACGCCCACGGCGTGTTTCTATCTGAGGTACATCAAACCTTCGAGGGTGACGCCTTCTTCCCAATGCTCAACGAAACAGAATTCGAGCTTGTCTCAACCGAAACCATTCAAGCTGTAATTCCGTACACCCACTCCGTTTATGCGCGTCGAAACGGCTAA

Lets save the three CDSs to file

# write first CDS to Am1_cassette_CDSs.fna, and append with 2cnd and 3rd to it using >>
fasta_toolkit.awk -R 3 -s 3 -e 500 Am1.fa > Am1_cassette_CDSs.fna
fasta_toolkit.awk -R 3 -s 612 -e 902 Am1.fa >> Am1_cassette_CDSs.fna
fasta_toolkit.awk -R 3 -s 908 -e 1699 Am1.fa >> Am1_cassette_CDSs.fna

We need to add consecutive ORF/gene names to distinguish them from one another

grep '^>' Am1_cassette_CDSs.fna

>Am1
>Am1
>Am1

perl -pe 'if(/^>(\S+)/){ $c++; $h=$1 . "_ORF" . $c; s/$1/$h/ }' Am1_cassette_CDSs.fna
perl -pe 'if(/^>(\S+)/){ $c++; $h=$1 . "_ORF" . $c; s/$1/$h/ }' Am1_cassette_CDSs.fna > ed && mv ed Am1_cassette_CDSs.fna

cat Am1_cassette_CDSs.fna

>Am1_ORF1
ATGAACTCGGAATCAGTACGCATTTATCTCGTTGCTGCGATGGGAGCCAATCGGGTTATTGGCAATGGTCCTAATATCCCCTGGAAAATTCCGGGTGAGCAGAAGATTTTTCGCAGACTCACTGAGGGAAAAGTCGTTGTCATGGGGCGAAAGACCTTTGAGTCTATCGGCAAGCCTCTACCGAACCGTCACACATTGGTAATCTCACGCCAAGCTAACTACCGCGCCACTGGCTGCGTAGTTGTTTCAACGCTGTCGCACGCTATCGCTTTGGCATCCGAACTCGGCAATGAACTCTACGTCGCGGGCGGAGCTGAGATATACACTCTGGCACTACCTCACGCCCACGGCGTGTTTCTATCTGAGGTACATCAAACCTTCGAGGGTGACGCCTTCTTCCCAATGCTCAACGAAACAGAATTCGAGCTTGTCTCAACCGAAACCATTCAAGCTGTAATTCCGTACACCCACTCCGTTTATGCGCGTCGAAACGGCTAA
>Am1_ORF2
ATGTTTATTCAAACGGCATTTAGCTTTTCAGGCGTTATTCAGTGCCTGTTTTGCCTTTTTTCCGGGCTTCGCCTGCATGGGCTGCGCAGGTTTTCAGTCTTTTTGGCCTCTAGCCCTTGCGTAGCAAGCGCAAGCAGCTATCGTTTTTGCAGTGCTGTGCCGCCTCGGTGGCGCAGCGTTTTTTCACGGTTAGCGCCCGTCGCCAAATTCAAGTTATCCGTTTTGGCTTCTGGGTCTAACATTTCGGTCAAGCCGACCGCCATTCTGCGGTCGGCTTACCTCGCCCGTTAG
>Am1_ORF3
ATGAGGGAAGCGGTGACCATCGAAATTTCGAACCAACTATCAGAGGTGCTAAGCGTCATTGAGCGCCATCTGGAATCAACGTTGCTGGCCGTGCATTTGTACGGCTCCGCAGTGGATGGCGGCCTGAAGCCATACAGCGATATTGATTTGTTGGTTACTGTGGCCGTAAAGCTTGATGAAACGACGCGGCGAGCATTGCTCAATGACCTTATGGAGGCTTCGGCTTTCCCTGGCGAGAGCGAGACGCTCCGCGCTATAGAAGTCACCCTTGTCGTGCATGACGACATCATCCCGTGGCGTTATCCGGCTAAGCGCGAGCTGCAATTTGGAGAATGGCAGCGCAATGACATTCTTGCGGGTATCTTCGAGCCAGCCATGATCGACATTGATCTAGCTATCCTGCTTACAAAAGCAAGAGAACATAGCGTTGCCTTGGTAGGTCCGGCAGCGGAGGAATTCTTTGACCCGGTTCCTGAACAGGATCTATTCGAGGCGCTGAGGGAAACCTTGAAGCTATGGAACTCGCAGCCCGACTGGGCCGGCGATGAGCGAAATGTAGTGCTTACGTTGTCCCGCATTTGGTACAGCGCAATAACCGGCAAAATCGCGCCGAAGGATGTCGCTGCCGACTGGGCAATAAAACGCCTACCTGCCCAGTATCAGCCCGTCTTACTTGAAGCTAAGCAAGCTTATCTGGGACAAAAAGAAGATCACTTGGCCTCACGCGCAGATCACTTGGAAGAATTTATTCGCTTTGTGAAAGGCGAGATCATCAAGTCAGTTGGTAAATGA

7.1.3.6 Translate the CDSs to their conceptual translation products using the universal genetic code

This operation can be easily performed with the help of fasta_toolkit.awk

fasta_toolkit.awk -R 4 Am1_cassette_CDSs.fna > Am1_cassette_CDSs_translated.faa

Visualize the resulting file

cat Am1_cassette_CDSs_translated.faa

>Am1_ORF1
MNSESVRIYLVAAMGANRVIGNGPNIPWKIPGEQKIFRRLTEGKVVVMGRKTFESIGKPLPNRHTLVISRQANYRATGCVVVSTLSHAIALASELGNELYVAGGAEIYTLALPHAHGVFLSEVHQTFEGDAFFPMLNETEFELVSTETIQAVIPYTHSVYARRNG*
>Am1_ORF2
MFIQTAFSFSGVIQCLFCLFSGLRLHGLRRFSVFLASSPCVASASSYRFCSAVPPRWRSVFSRLAPVAKFKLSVLASGSNISVKPTAILRSAYLAR*
>Am1_ORF3
MREAVTIEISNQLSEVLSVIERHLESTLLAVHLYGSAVDGGLKPYSDIDLLVTVAVKLDETTRRALLNDLMEASAFPGESETLRAIEVTLVVHDDIIPWRYPAKRELQFGEWQRNDILAGIFEPAMIDIDLAILLTKAREHSVALVGPAAEEFFDPVPEQDLFEALRETLKLWNSQPDWAGDERNVVLTLSRIWYSAITGKIAPKDVAADWAIKRLPAQYQPVLLEAKQAYLGQKEDHLASRADHLEEFIRFVKGEIIKSVGK*

7.2 Assign functional annotations to the 3 CDSs encoded in the Am1 class-I integron gene cassettes

We will run BLASTP a final time to get the best functional annotations for the three ORFs (gene cassettes) identified via BLASTX in the Am1 amplicon.

# 1. run blastp with the translated products and get 10 top hits
echo -e 'qseqid\tsseqid\tqlen\tslen\tstitle\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastp -query Am1_cassette_CDSs_translated.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out

cat header1.tab blastp_10hits.out > ed && mv ed blastp_10hits.out

# 2. reduce the output table to relevant fields for easier visualization
sed 's/\t/\n/g' header1.tab | cat -n

     1  qseqid
     2  sseqid
     3  qlen
     4  slen
     5  stitle
     6  length
     7  pident
     8  positive
     9  mismatch
    10  gaps
    11  evalue
    12  bitscore
    13  qcovs

awk 'BEGIN{FS="\t"; OFS=","; print "qseqid,qlen,slen,stitle,pident,bitscore,qcovs"}{print $1,$3,$4,$5,$7,$12,$13}' blastp_10hits.out > Am1_top10_blastp_hits_for_3ORFs.csv
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tqlen\tslen\tstitle\tpident\tbitscore\tqcovs"}{print $1,$3,$4,$5,$7,$12,$13}' blastp_10hits.out > Am1_top10_blastp_hits_for_3ORFs.tsv

qseqid,qlen,slen,stitle,pident,bitscore,qcovs
Am1_ORF1,166,166,|[Serratia marcescens]dhfrXII|dihydrofolate reductase type XII|498|AF284063|unpubl,100.000,340,100
Am1_ORF1,166,166,|[Acinetobacter baumannii]|ZJB04|||dhrfXII|dihydrofolate reductase|498|DQ995286|Xu_H.|Curr.Microbiol.59_113-117_2009|19365688,99.398,338,100
Am1_ORF1,166,166,|[Salmonella enterica subsp. enterica serovar Choleraesuis]dhfr|Dhfr|498|AY551331|Huang_T.M.|Vet.Microbiol.100_247-254_2004|15145503,99.398,338,100
Am1_ORF1,166,166,|[Enterococcus faecalis]|||human male bone|China:GuangzhoudhfrXII|dihydrofolate reductase|498|AB196348|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182,98.795,337,100
Am1_ORF1,166,178,|[Salmonella enterica subsp. enterica serovar Choleraesuis]|OU7519|hybrid of pSCV50 and pSC138||tri|Tri|534|EU219534|unpubl,100.000,333,98
Am1_ORF1,166,158,|[Escherichia coli]||Rs411b||dhfr17|dihydrofolate reductase|474|DQ875874|Ozugumus_O.B.|J.Microbiol.45_379-387_2007|17978796,36.709,108,95
Am1_ORF1,166,163,|[Escherichia coli]dfr17|dihydrofolate reductase|491|AY828551|Zhang_H.|Microbiol.Immunol.48_639-645_2004|15383699,36.709,108,95
Am1_ORF1,166,158,|[Acinetobacter baumannii]||607460||dfrVII|dihydrofolate reductase|474|EU340417|Xu_X.|Int.J.Antimicrob.Agents32_441-445_2008|18757181,38.608,107,95
Am1_ORF1,166,183,|[Escherichia coli]DfrA17||549|AM886293|unpubl,36.709,108,95
Am1_ORF1,166,158,|[Escherichia coli]||clinical isolate||dhfrIb|dihydrofolate reductase type Ib|474|Z50805|unpubl,37.342,105,95
Am1_ORF2,97,97,|[Serratia marcescens]|unknown|291|AF284063|unpubl,98.969,183,100
Am1_ORF2,97,97,|[Klebsiella pneumoniae]|KpS15unknown|291|AF180731|unpubl,97.938,183,100
Am1_ORF2,97,97,|[Enterococcus faecalis]|||human male bone|China:Guangzhou|putative protein|291|AB196348|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182,97.938,181,100
Am1_ORF2,97,97,|[uncultured bacterium]|unknown|291|AY115474|Tennstedt_T.|FEMSMicrobiol.Ecol.45_239-252_2003|19719593,94.845,176,100
Am1_ORF2,97,97,|[uncultured bacterium]|||arable soil||hypothetical protein|291|DQ779005|Heuer_H|Environ.Microbiol.9_657-666_2007|17298366,78.351,137,100
Am1_ORF2,97,105,|[Aeromonas salmonicida subsp. salmonicida]|123-86 3101unknown|315|AF327731|L'Abee-Lund_T.M|Microb.DrugResist.7_263-272_2001|11759088,78.351,137,100
Am1_ORF2,97,101,|[Pseudomonas aeruginosa]|orfD|303|AY460181|unpubl,77.174,129,95
Am1_ORF2,97,93,|[Pseudomonas aeruginosa]||4677||MexicoorfD|hypothetical protein|279|EF184217|unpubl,76.087,127,95
Am1_ORF2,97,88,|[Salmonella enterica subsp. enterica serovar Enteritidis]||Homo sapiens||Brazil|unknown|266|DQ278190|Peirano_G.|J.Antimicrob.Chemother.58_305-309_2006|16782743,76.136,122,91
Am1_ORF2,97,97,|[Salmonella enterica subsp. enterica serovar Agona]||Homo sapiens||Brazil|unknown|291|DQ278189|Peirano_G.|J.Antimicrob.Chemother.58_305-309_2006|16782743,76.289,115,100
Am1_ORF3,264,273,|[Klebsiella pneumoniae]|NK29aadA2|819|EF382672|Chen_Y.T.|Antimicrob.AgentsChemother.51_3004-3007_2007|17526756,100.000,531,100
Am1_ORF3,264,264,|[Serratia marcescens]aadA2|streptomycin/spectinomycin adenyltransferase|792|AF284063|unpubl,100.000,530,100
Am1_ORF3,264,264,|[uncultured bacterium]aadA2|streptomycin/spectinomycin adenylyltransferase|792|AY115474|Tennstedt_T.|FEMSMicrobiol.Ecol.45_239-252_2003|19719593,99.621,529,100
Am1_ORF3,264,264,|[Aeromonas hydrophila]|CVM861||diarrheic pig|aada2|aminoglycoside adenyltransferase|792|AY522923|Poole_T.L.|J.Antimicrob.Chemother.57_31-38_2006|16339607,99.621,529,100
Am1_ORF3,264,268,|[Salmonella enterica subsp. enterica serovar Choleraesuis]|OU7519|hybrid of pSCV50 and pSC138||aadA2|AadA2|804|EU219534|unpubl,99.621,528,100
Am1_ORF3,264,270,|[Proteus mirabilis]|C04014|||aadA2|AadA2|810|EU006711|Boyd_D.A.|Antimicrob.AgentsChemother.52_340-344_2008|18025121,99.621,528,100
Am1_ORF3,264,264,|[Proteus mirabilis]aadA2|aminoglycoside adenyltransferase|792|AY681136|unpubl,99.621,528,100
Am1_ORF3,264,264,|[uncultured bacterium]|||arable soil|aadA2|aminoglycoside-3'-adenylyltransferase|792|DQ779004|Heuer_H|Environ.Microbiol.9_657-666_2007|17298366,99.621,527,100
Am1_ORF3,264,264,|[Salmonella typhimurium DT104]|H3380|Salmonella typhimurium phage type DT104||aadA2|streptomycin/spectinomycin adenyltransferase|792|AF071555|Briggs_C.E|Antimicrob.AgentsChemother.43_846-849_1999|10103189,99.242,526,100
Am1_ORF3,264,264,|[Escherichia coli]|W21-3||sewage of swine farm|China: GuangzhouaadA2|streptomycin/spectinomycin 3' adenyltransferase|792|DQ157750|unpubl,98.864,522,100

Table 2 Annotation of the gene-cassettes found in the Am1 amplicon

ORF	start	end	strand	gene	annotation
1	3	500	+	dhfrXII	dihydrofolate reductase type XII
2	612	902	+	orfD	hypothetical protein
3	908	1699	+	aadA2	aminoglycoside adenyltransferase type II

Figure 4. Prevalence of class-1 integrons and distribution of identified gene cassette arrays among Enterobacteriaceae isolates recovered from clean and contaminated rivers in Morelos, Mexico. Results from Jazmín Ramos Madrigal’s bachelor thesis @lcgunam.

7.2.1 HOMEWORK 7

Repeat the BLASTX based identification of gene cassettes for the Ap18 and Ap26 amplicons, extract the corresponding CDSs, translate them and annotate them with BLASTP, writing a table with the single-best annotation for each ORF.

8 Recommended readings

8.1 Homologs and homology searching with BLAST

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990). Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997). Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421
BLAST command-line manual
- BLAST cookbook
Ian Korf, Mark Yandell, Joseph Bedell (2003). BLAST. O’Reilly Media, Inc. ISBN: 9780596002992
Hu G, Kurgan L. Sequence Similarity Searching. Curr Protoc Protein Sci. 2019 Feb;95(1):e71. doi: 10.1002/cpps.71
Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005;39:309-38

8.2 Protein sequence and structure databases

UniProt Consortium - UniProt: the Universal Protein Knowledgebase in 2023
RCSB Protein Data Bank - Documentation
AlphaFold Protein Structure Database - about

8.3 Ras superfamily and Rab-GTPases

Rojas AM, Fuentes G, Rausell A, Valencia A (2012). The Ras protein superfamily: evolutionary tree and role of conserved amino acids. J Cell Biol. 2012 Jan 23;196(2):189-201.
Krister Wennerberg, Kent L Rossman, Channing J Der (2005). The Ras superfamily at a glance J Cell Sci 2005 Mar 1;118(Pt 5):843-6.doi: 10.1242/jcs.01660.
Yan Zhen, Harald Stenmark (2015). Cellular functions of Rab GTPases at a glance. J Cell Sci Sep 1;128(17):3171-6. doi: 10.1242/jcs.166074. Epub 2015 Aug 13.
Borchers AC, Langemeyer L, Ungermann C. (2021). Who’s in control? Principles of Rab GTPase activation in endolysosomal membrane trafficking and beyond. J. Cell Biol. 2021 Sep 6;220(9):e202105120.

8.4 Cellular microbiology of Acanthamoeba-Stenotrophomonas interactions

Javier Rivera, Julio C. Valerdi-Negreros, Diana M. Vázquez-Enciso, Fulvia-Stefany Argueta-Zepeda, Pablo Vinuesa* 2024. Phylogenomic, structural, and cell-biological analyses reveal that Stenotrophomonas maltophilia replicates in acidified Rab7A-positive vacuoles of Acanthamoeba catellanii. Microbiol. Spectr. 2024 Mar 5;12(3):e0298823. doi: 10.1128/spectrum.02988-23. Epub 2024 Feb 6..

8.5 Integrons

Michael R Gillings (2014). Integrons: past, present, and future . Microbiol Mol Biol Rev. 78(2):257-77. doi: 10.1128/MMBR.00056-13.
Sally R Partridge, Stephen M Kwong, Neville Firth, Slade O Jensen (2018). Mobile Genetic Elements Associated with Antimicrobial Resistance. Clin Microbiol Rev. 2018 Aug 1;31(4):e00088-17. doi: 10.1128/CMR.00088-17.

Running and parsing BLAST+ using the Linux command line

Pablo Vinuesa, CCG-UNAM, Cuernavaca, México; @pvinmex

2025-08-03