The following tutorials teach how to run NCBI [legacy-blast and] BLAST+ from the command line, along with useful AWK, Bash, Perl and R code and idioms, including 1-liner programs and fully fledged scripts to parse and analyze results efficiently. Particularly helpful is the advanced compute_RBH_clusters.sh script.
This tutorial expects that the user has access to a Unix/Linux environment with the standard Bash shell installed, as well as the [legacy- and] blast+ software suite(s), which are freely available for download from the NCBI FTP server
For further instruction on installing and using the BLAST+ suite, read NCBI’s BLAST Book and NCBI’s The Statistics of Sequence Similarity Scores tutorial.
BLAST finds regions of similarity between a query and subject sequences. The program compares nucleotide or protein query (problem) sequences to sequence databases and calculates the statistical significance of the resulting query-subject pairwise alignments expressed as an Expect (E-value).
To run a BLAST SEARCH we must select the proper blast program based on the nature of our query sequence(s) and that of the database to search for homologs.
query | database | program | note |
---|---|---|---|
DNA | DNA | blastn | - |
PROT | PROT | blastp | - |
DNA | PROT | blastx | The DNA query is translated in the 6 frames and the products searched against protein DB |
PROT | DNA | tblastn | The DNA DB (genome) is translated in the 6 frames and queried with protein sequences |
DNA | DNA | tblastx | The DNA query and DNA DB are both translated in the 6 frames and searches performed at the protein level |
Table 1. Five frequently used BLAST programs.
Figure 1. The four standard BLAST programs most frequently used for database searching.
Query sequences are stored in a FASTA-formatted file. In addition to the programs listed in table 1, the blast+ software suite also provides the makeblastdb and blastdbcmd programs to make custom DNA or protein databases (makeblastdb) and retrieve the hits from the corresponding blast database found in a blast search (blastdbcmd).
The complete set of blast+ binaries in the ncbi-blast-2.14.1+-x64-linux.tar.gz release are:
blastdb_aliastool blastdbcheck blastdbcmd blast_formatter blast_formatter_vdb blastn blastn_vdb blastp blast_vdb_cmd blastx cleanup-blastdb-volumes.py convert2blastmask deltablast dustmasker get_species_taxids.sh legacy_blast.pl makeblastdb makembindex makeprofiledb psiblast rpsblast rpstblastn segmasker tblastn tblastn_vdb tblastx update_blastdb.pl windowmasker
To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone.
In 1990, Samuel Karlin and Stephen Altschul published a paper describing the statistics underlying local alignment scores based on the hit expectancy value or E-value.
\[ E = kmne^{-{\lambda}S} \] This equation states that the number of alignments expected by chance (\(E\)) during a blast search for a given high-scoring pair (HSP) with a raw score (\(S\)), is a function of the size of the search space (effective query sequence length \(m\) * effective length of the database \(n\) (in number of residues)), the normalized score of the hit (\({\lambda}*S\)), and a minor constant \(k\).
The following formula provides the relationship between both values:
\[P = 1 -e^{-E}\]
Note that for values \(<0.001\), the E-value and the P-value for a given HSP are identical, as we will demonstrate graphically in the section “Using base R to explore BLAST tabular output”.
Read the handy and straightforward NCBI’s Statistics of Sequence Similarity Scores tutorial to learn more about KA-statistics.
I have also found these scribes on BLAST and Karlin-Altschul Statistics from Brown University useful.
The following code will set up our working directories for the tutorials and fetch the required sequences.
# i) Download or copy the sequence data to your working directories for this practice
cd && [ ! -d sesion2_blast ] && mkdir sesion2_blast && cd sesion2_blast
# ii) use wget on the command line to fetch the required files from my TIB-filoinfo GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/16S_4blastN.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/gene_discovery_and_annotation_using_blastx.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/UniProt_proteomes.tgz
# iii) Generate two subdirectories to hold the data and analysis results
mkdir BLAST_DB_AA BLAST_DB_NT UniProt_proteomes
# iv) Save the path to our parental working directory in a variable
wkdir=$(pwd)
echo $wkdir
# v) Move each *tgz file to the proper dir
mv 16S_4blastN.tgz BLAST_DB_NT
mv UniProt_proteomes.tgz UniProt_proteomes
mv gene_discovery_and_annotation_using_blastx.tgz BLAST_DB_AA
We are set to start the hands-on tutorials on running BLAST from the command line.
The exercise aims to teach the basics of a BLAST analysis. We will generate an indexed BLAST database of nucleotide sequences and interrogate it using blastn.
We will use the following data for the exercise.
The objective of the exercise is to classify the query sequences in 16S_problema.fna at the genus level based on BLAST statistics.
Move into the directory holding the nucleotide sequences and explore both files using standard shell filtering tools and modify the FASTA headers in 16S_seqs4_blastDB.fna to make them suitable for formatdb|makeblastdb indexing.
# We need to untar & uncompress (gunzip) the compressed tar file
cd BLAST_DB_NT/ && tar -xvzf 16S_4blastN.tgz
# 1) explore the fasta headers:
grep '>' 16S_seqs4_blastDB.fna |less
# 1.1) How many genera and species per genus are found in the source file 16S_seqs4_blastDB.fna?
# i. the genera
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1 | sort | uniq -c | sort -nrk1
# ii. the species
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1,2 | sort | uniq -c | sort -nrk1
# 1.2) Are the sequences in a suitable format for makeblastdb indexing?
# What command would you use to inspect the sequences?
# 1.2.1) generate a proper FASTA header for db indexing
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna | grep '^>'
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna > 16S_seqs4_blastDB.fnaed
Note: The commented command lines in the following blocks provide the code to run legacy blast. The exercises are demonstrated using the blast+ suite of programs.
You can display the compact or detailed help formats of any blast+
program using blastprogramx -h
or blastprogramx
-help
, respectively.
#formatdb -i 16S_seqs4_blastDB.fnaed -p F -o T
# display the compact help format
makeblastdb -h
# display the detailed help format
makeblastdb -help
# run makeblastdb to generate an indexed database of nucleotide sequences
makeblastdb -in 16S_seqs4_blastDB.fnaed -dbtype nucl -parse_seqids
The simplest blastn (or any other standard blast search
program) invocation requires only two parameters: blastn -query
<FASTA_file> -db <database_file>
.
The following blastn call will search the
16S_seqs4_blastDB.fnaed database we generated in the previous section
with the 16S_problema.fna query sequences, using all default values of
the search parameters, as defined in the blastn -help
output. This includes the standard output format familiar to users of
the Web implementation of NCBI-BLAST.
The output (truncated to save space) has the following sections:
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed
BLASTN 2.14.1+ Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14. Database: 16S_seqs4_blastDB.fnaed 604 sequences; 880,434 total letters Query= 13 Length=1359 Score E Sequences producing significant alignments: (Bits) Value 16S_DB:480 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 2475 0.0 16S_DB:573 Mycobacterium_arupense_T|AR30097|DQ157760 2453 0.0 16S_DB:501 Mycobacterium_hiberniae_T|ATCC_9874|X67096 2407 0.0 16S_DB:592 Mycobacterium_senuense_T|05-832|DQ536408 2355 0.0 16S_DB:583 Mycobacterium_kumamotonense_T|CST7274_(=_GTC2729)|AB23... 2350 0.0 16S_DB:579 Mycobacterium_pallens_T|czh-8|DQ370008 2252 0.0 16S_DB:531 Mycobacterium_asiaticum_T|ATCC_25276|AF480595 2244 0.0 ... TRUNCATED >16S_DB:480 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 Length=1466 Score = 2475 bits (1340), Expect = 0.0 Identities = 1352/1359 (99%), Gaps = 2/1359 (0%) Strand=Plus/Plus Query 1 GTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCATCTTGGGATAA 60 ||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 48 GTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCACTTTGGGATAA 107 Query 61 GCCTGGGAAACTGGGTCTAATACCGAATAGGACCGCATGCTGCATGGTGTGTGGTGGAAA 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 108 GCCTGGGAAACTGGGTCTAATACCGAATAGGACCGCATGCTGCATGGTGTGTGGTGGAAA 167 Query 121 GCTTTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTA 180 || ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 168 GC-TTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTA 226 Query 181 CCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATAC 240 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 227 CCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATAC 286 ... TRUNCATED >16S_DB:102 Aeromonas_enteropelogenes_T|DSM_6394|X71121 Length=1490 Score = 920 bits (498), Expect = 0.0 Identities = 970/1204 (81%), Gaps = 23/1204 (2%) Strand=Plus/Plus Query 153 ATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGACGACGGGTAGCCGGCCTGAGAGG 212 || |||| ||||||| ||||||||| ||||||||||||| |||| || |||||||| Sbjct 235 ATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGG 294 Query 213 GTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGG 272 || | |||||||||| ||||||| |||| ||||||||||||||||||||||||||||| Sbjct 295 ATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGG 354 Query 273 AATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGACGGCCTTC 332 |||||||||||||||| | || ||||||||||| | |||||||| | || || ||||||| Sbjct 355 AATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTC 414 Query 333 GGGTTGTAAACCTCTTTCAGTATCGGCG-AAGCTCCA-AGGT--TTTCT-TTGG-GGTGA 386 |||||||||| | ||||||| | | ||| || || | | ||| ||| |||| Sbjct 415 GGGTTGTAAAGCACTTTCAGCGAGGAGGAAAGGTCAGTAGCTAATATCTGCTGGCTGTGA 474 Query 387 CGGTAGGTACAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGG 446 || || |||||||||||||||| |||| |||||||||||||||||||||||| |||| Sbjct 475 CGTTACTCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGG 534 ... TRUNCATED Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 946715250 Database: 16S_seqs4_blastDB.fnaed Posted date: Aug 6, 2022 3:00 PM Number of letters in database: 880,434 Number of sequences in database: 604 Matrix: blastn matrix 1 -2 Gap Penalties: Existence: 0, Extension: 2.5
Note that the header and footer sections of the standard blast output contain essential information (m, n, k, \(\lambda\)) required to compute the E-value (see Karlin-Altschul equation \(E = kmne^{-{\lambda}S}\)**, as detailed in a previous section).
The following little AWK one-liner helps extract key
statistical parameters from the standard output of a blast[npx] run. The
AWK program has the following structure: /pattern/,
/pattern/; /pattern/, /pattern/{action}{action}
.
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -num_alignments 1 | awk '/Database/, /total letters/; /^Lambda/, /Extension:/{seen++; print}{ if(seen>5) exit}'
Database: 16S_seqs4_blastDB.fnaed 604 sequences; 880,434 total letters Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850
A standard blastn search output can be lengthy and excessively wordy. For the aim of the exercise, it is more convenient to use a tabular output format reporting only the best hit found in the database, as we want to classify our unknown 16S rDNA sequences based on their best database hit.
This can be easily achieved with the following blastn call:
# 2.1) run blastn (blastall -p blastn), and get only the best hit (-b 1) in tabular format (-m 8).
# blastall -p blastn -i 16S_problema.fna -b 1 -a 6 -d 16S_seqs4_blastDB.fnaed -m 8 > 16S_problema_blastN_b1_m8.tab
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt 6 -num_alignments 1 > 16S_problema_blastN_maln1_m6.tab
# 2.2) explore the table structure
head 16S_problema_blastN_maln1_m6.tab
13 16S_DB:480 99.485 1359 5 2 1 1359 48 1404 0.0 2475 18 16S_DB:494 99.033 1344 10 3 1 1344 49 1389 0.0 2416 2 16S_DB:549 99.552 1340 5 1 4 1343 81 1419 0.0 2440 27 16S_DB:546 100.000 1342 0 0 1 1342 78 1419 0.0 2479 36 16S_DB:480 99.705 1358 3 1 1 1358 48 1404 0.0 2490 4 16S_DB:600 99.335 1354 4 4 1 1353 79 1428 0.0 2446 48 16S_DB:484 98.375 1354 20 1 2 1355 48 1399 0.0 2390 53 16S_DB:548 99.925 1342 1 0 1 1342 78 1419 0.0 2473 62 16S_DB:496 99.106 1342 8 4 1 1342 85 1422 0.0 2409 7 16S_DB:600 98.736 1345 14 3 1 1343 79 1422 0.0 2386
As can be seen in the output above, the -outfmt 6
tabular format has no header.
The default -outfmt 6
blast output contains the
following 12 fields
# The default m6 blast output fields #===================================================================================== # >>> The default -outfmt 6 | -m 8 blast output contains the following 12 fields <<< # qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore #------------------------------------------------------------------------------------- # 1. qseqid query (e.g., gene) sequence id # 2. sseqid subject (e.g., reference genome) sequence id # 3. pident percentage of identical matches # 4. length alignment length # 5. mismatch number of mismatches # 6. gapopen number of gap openings # 7. qstart start of alignment in query # 8. qend end of alignment in query # 9. sstart start of alignment in subject #10. send end of alignment in subject #11. evalue expect value #12. bitscore bit score #-------------------------------------------------------------------------------------
# 2.3 Add a header to the standard output; print to file and append to 16S_problema_blastN_maln1_m6.tab
echo -e "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore" > standard_blast_fields.tsv
cat standard_blast_fields.tsv 16S_problema_blastN_maln1_m6.tab > t && mv t 16S_problema_blastN_maln1_m6.tab
head -20 16S_problema_blastN_maln1_m6.tab | column -t
# 2.4) how many unique hits are there?
cut -f 2 16S_problema_blastN_maln1_m6.tab | sed '1d' | sort | uniq -c
# Generate the list of hit IDs for each query
# NOTE: the hits are in the same order as the query sequences
cut -f2 16S_problema_blastN_maln1_m6.tab | sed '1d' > IDs4blastdbcmd.txt
#blastdbcmd -i IDs4blastdbcmd.txt -d 16S_seqs4_blastDB.fnaed | grep '>' |cut -d\| -f3 | cut -d' ' -f2 > hit_sequences_b1.fna
blastdbcmd -entry_batch IDs4blastdbcmd.txt -db 16S_seqs4_blastDB.fnaed > best-hit_sequences.fna
# 1) Generate a list of the hits from the FASTA headers
grep '>' best-hit_sequences.fna | cut -d' ' -f2 > hit_sequence_headers.txt
# 2) Classify our query sequences, based on the subject headers of each best hit, using a simple tabular output format
cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d' > problem_seqs.txt
paste problem_seqs.txt hit_sequence_headers.txt > classified_16S_problema_sequences.tab
head -5 classified_16S_problema_sequences.tab
13 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
A more efficient and idiomatic (“Bashish”) code to achieve the same
result, without having to write intermediate files, is the use of
“process substitution”, which has this general syntax:
<(any comamnd(s) or pipeline here)
. Process substitution
replaces the command with its output, treating it as if it were stored
in a file. This output can therefore be used by commands like \(cat\), \(paste\), or any other that works on files.
Let us see it in action:
paste <(cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head | column -t
13 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 Mycobacterium_chitae_T|ATCC_19627|X55603 7 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 ... TRUNCATED
The previous result looks great but needs to include critical statistical information about the robustness of the hits. Let us use the “process substitution” trick that we learned in the previous example to add extra fields from the blast output table (pident, length, e-value, bitscore) to our summary table, including a header.
# 3) add also columns 3, 4, 11, 12 (pident, length,e-value & bit score) to the tabular output, including a header
cat <(echo -e "sid\tpid\tlen\tEval\tscore\ttax") <( paste <(cut -f1,3,4,11,12 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head) | column -t
sid pid len Eval score tax 13 99.485 1359 0.0 2475 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 99.033 1344 0.0 2416 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 99.552 1340 0.0 2440 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 100.000 1342 0.0 2479 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 99.705 1358 0.0 2490 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 99.335 1354 0.0 2446 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 98.375 1354 0.0 2390 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 99.925 1342 0.0 2473 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 99.106 1342 0.0 2409 Mycobacterium_chitae_T|ATCC_19627|X55603 7 98.736 1345 0.0 2386 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 ... TRUNCATED
Note: we are using Linux commands to achieve a task
that can be performed directly within BLAST, using a customized
tabular output, as we will see in the final section on BLASTN,
and explained in the blast[n|p] -help
.
It is important to realize that finding a hit in the database does not necessarily mean that it supports a reliable identification or classification of the query sequence. In order to have a reasonably high confidence in the classification of 16S rDNA sequences at the genus level, based on the best \(blastn\) hit, it should at least satisfy the following conditions:
Let us first explore the BLAST table to identify hits that do not satisfy the two criteria defined above (pident >= 95% && length >= 500). \(AWK\) is ideal for this task.
# we could use any of the following lines to print the results, including a header
# cat <(cat standard_blast_fields.tsv) <(awk 'BEGIN{FS=OFS="\t"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab) | column -t
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab | column -t
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore 1367SAA005_14Cip-FD1.b1.ab1_638 16S_DB:86 93.218 634 39 4 1 632 37 668 0.0 929 1367SAA005_A17Cip-FD1.b1.ab1_171 16S_DB:86 95.930 172 6 1 1 171 35 206 2.51e-76 278 1367SAA005_A6BSm-FD1.b1.ab1_477 16S_DB:86 98.745 478 5 1 1 477 37 514 0.0 848 1367SAA005_S13-FD1.b1.ab1_121 16S_DB:534 88.333 120 11 3 3 121 64 181 2.33e-35 141 1367SAA005_S2CbTm-FD1.b1.ab1_84 16S_DB:269 97.647 85 1 1 1 84 47 131 1.15e-36 145 1367SAA005_S4CbTm-FD1.b1.ab1_172 16S_DB:269 100.000 172 0 0 1 172 52 223 1.49e-88 318 1367SAA005_S6Sm-FD1.b1.ab1_347 16S_DB:184 99.143 350 0 3 1 347 40 389 0.0 627 1367SAA005_S9b-FD1.b1.ab1_142 16S_DB:3 97.887 142 3 0 1 142 40 181 1.60e-67 248 1367SAA005_Tm203-FD1.b1.ab1_491 16S_DB:128 97.942 486 9 1 3 487 52 537 0.0 841 1367SAA005_VNP1-FD1.b1.ab1_578 16S_DB:435 94.965 576 28 1 3 578 31 605 0.0 902
In order to get a confident blastn-based classification of our
problem sequences at the genus level, we should use only hits with
pident >= 95
% and a minimal alignment
length >= 500
bp.
paste <(awk 'BEGIN{FS=OFS="\t"}NR > 1{print $1,$3,$4,$11,$12}' 16S_problema_blastN_maln1_m6.tab) <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | awk 'BEGIN{FS=OFS="\t"; print "sid\tpid\tlen\tEval\tscore\tbest_hit"}$2 >= 95 && $3 >= 500' > 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv
Display the table’s head and tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv
cat <(head 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) <(tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) | column -t
sid pid len Eval score best_hit 13 99.485 1359 0.0 2475 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 99.033 1344 0.0 2416 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 99.552 1340 0.0 2440 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 100.000 1342 0.0 2479 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 99.705 1358 0.0 2490 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 99.335 1354 0.0 2446 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 98.375 1354 0.0 2390 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 99.925 1342 0.0 2473 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 99.106 1342 0.0 2409 Mycobacterium_chitae_T|ATCC_19627|X55603 1367SAA005_S11Sm-FD1.b1.ab1_1049 98.763 1051 0.0 1866 Pseudomonas_monteilii_T|CIP_104883|AF064458 1367SAA005_S12-FD1.b1.ab1_899 98.210 894 0.0 1559 Microbacterium_testaceum_T|DSM_20166|X77445 1367SAA005_S15-FD1.b1.ab1_1111 98.115 1114 0.0 1936 Microbacterium_testaceum_T|DSM_20166|X77445 1367SAA005_S1CbTm-FD1.b1.ab1_683 99.227 647 0.0 1166 Pseudomonas_monteilii_T|CIP_104883|AF064458 1367SAA005_S1-FD1.b1.ab1_542 99.815 542 0.0 996 Pseudomonas_plecoglossicida_T|FPC951|AB009457 1367SAA005_S7CbTm-FD1.b1.ab1_961 98.954 956 0.0 1709 Citrobacter_freundii_T|DSM_30039|AJ233408 1367SAA005_S9-FD1.b1.ab1_1091 99.451 1092 0.0 1984 Cellulosimicrobium_funkei_T|W6122|AY501364 1367SAA005_VNC15-FD1.b1.ab1_1062 99.052 1055 0.0 1890 Rhizobium_alamii_T|type_strain:GBV016|AM931436 1367SAA005_VRP15_bis-FD1.b1.ab1_1124 99.822 1125 0.0 2065 Ensifer_terangae_T|type_strain:_LMG_7834|X68388 1367SAA005_VRP15-FD1.b1.ab1_1112 99.012 1113 0.0 1993 Ensifer_terangae_T|type_strain:_LMG_7834|X68388
The previous examples illustrated how to efficiently combine \(Shell\) and \(Linux\) tools to process BLAST output tables. With a bit of practice, you will soon master these power tools.
Note: We are using Linux commands to achieve a task
that can be performed directly within BLAST, using a customized
tabular output, as we will see in the final section on BLASTN
“making-use-of-a-customized-tabular-output-with-selected-fields-for-a-slick-and-neat-blastn-based-classification-strategy-of-16s-rdna-sequences”,
and explained in the blast[n|p] -help
.
Finally, let us generate some simple summary statistics of the classified sequences.
# 6 How many sequences could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | wc -l # 42
42
# 7 How many sequences per genus could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | cut -f6 | sed 's/_.*$//' | sort | uniq -c | sort -nrk1
14 Mycobacterium 5 Rhizobium 5 Ensifer 4 Pseudomonas 2 Microbacterium 2 Escherichia/Shigella 2 Beijerinckia 1 Sphingomonas 1 Sinorhizobium 1 Shinella 1 Isoptericola 1 Citrobacter 1 Cellulosimicrobium 1 Brevundimonas 1 Acinetobacter
The strategy shown above works perfectly well but could be more
succinct. However, it helped demonstrate interesting Bash
idioms, like the process substitution trick we learned
to avoid writing intermediate files to disk. This strategy, or similar,
was required to add subject title information to the tabular output
generated by the legacy blast programs run with -m 8 (tabular output).
The current BLAST+ suite is much more powerful and
versatile in multiple ways, including the impressive amount of
information from HSPs that can be added to tabular outputs
(-outfmt 6|7|10
), as can be learned from the
blast[npx] -help
output.
We will use this feature to significantly simplify the 16S rDNA sequence classification strategy based on best blastn hits satisfying some minimal alignment criteria. In particular, we will use the stitle (subject title) column in the output, which contains the taxonomic information from the FASTA header. In summary, we will run a new blastn search including selected output fields to avoid retrieving the subject hit FASTA files with fastadbcmd, and grep out their header lines to paste them to the table.
To print a tabular output with custom fields use the following
syntax: blastn -outfmt ‘6 keyword1 keyword2 …’
.
The following blastn call with customized tabular output,
containing the desired ‘qseqid stitle pident length evalue score’ fields
will do the trick. We will also add the query length ‘qlen’ and query
coverage ‘qcovs’ fields as the final two columns. We filter the
blastn output with AWK to print only HSPs with
pident >= 95 && length > 500
and finally
format the output with column -t
for a tidy columnar
display on the terminal. That’s it, slick and neat.
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt '6 qseqid stitle pident length evalue score qlen qcovs' -num_alignments 1 | awk 'BEGIN{FS=OFS="\t"; print "seqid\tstitle\tpident\tlength\tevalue\tscore\tqlen\tqcovs"}$3 >= 95 && $4 >= 500' | column -t
seqid stitle pident length evalue score qlen qcovs 13 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 99.485 1359 0.0 1340 1359 100 18 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 99.033 1344 0.0 1308 1347 99 2 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 99.552 1340 0.0 1321 1343 99 27 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 100.000 1342 0.0 1342 1342 100 36 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 99.705 1358 0.0 1348 1358 100 4 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 99.335 1354 0.0 1324 1354 99 48 Mycobacterium_gordonae_T|ATCC_14470.|X52923 98.375 1354 0.0 1294 1356 99 53 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 99.925 1342 0.0 1339 1342 100 62 Mycobacterium_chitae_T|ATCC_19627|X55603 99.106 1342 0.0 1304 1342 100 7 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 98.736 1345 0.0 1292 1350 99 8 Mycobacterium_mantenii_T|NLA000401474|FJ042897 99.926 1351 0.0 1347 1350 100 85 Mycobacterium_avium_T|ATCC_49884|EF521891 98.968 1357 0.0 1313 1356 100 95 Mycobacterium_goodii_T|M069|Y12872 99.702 1344 0.0 1334 1344 100 96 Mycobacterium_arupense_T|AR30097|DQ157760 99.705 1358 0.0 1345 1358 100 1367SAA005_2NXC111R-FD1.b1.ab1_579 Rhizobium_loessense_T|CCBAU_7190B|AF364069 97.396 576 0.0 530 579 99 1367SAA005_2NXC11R-FD1.b1.ab1_1153 Shinella_yambaruensis_T|MS4|AB285481 96.766 1144 0.0 1031 1153 99 1367SAA005_2NXC14R-FD1.b1.ab1_578 Rhizobium_loessense_T|CCBAU_7190B|AF364069 97.751 578 0.0 538 578 100 1367SAA005_A17-FD1.b1.ab1_1072 Sphingomonas_aromaticivorans_T|DSM_12444|CP000248 97.577 1073 0.0 994 1072 100 1367SAA005_A1b-FD1.b1.ab1_1010 Pseudomonas_plecoglossicida_T|FPC951|AB009457 98.811 1009 0.0 972 1010 99 1367SAA005_A2CbTm-FD1.b1.ab1_908 Escherichia/Shigella_flexneri_T|X96963 99.008 907 0.0 880 908 99 1367SAA005_A5-FD1.b1.ab1_597 Isoptericola_variabilis_T|MX5|AJ298873 97.138 594 0.0 541 597 99 1367SAA005_A7-FD1.b1.ab1_763 Brevundimonas_nasdae_T|GTC1043|AB071954 98.949 761 0.0 737 763 99 1367SAA005_A7Sm-FD1.b1.ab1_896 Acinetobacter_baumannii_T|DSM_30007|X81660 96.763 865 0.0 780 896 97 1367SAA005_A9Cip-FD1.b1.ab1_579 Escherichia/Shigella_flexneri_T|X96963 99.481 578 0.0 569 579 99 1367SAA005_ACOP7-FD1.b1.ab1_1110 Rhizobium_hainanense_T|I66|U71078 97.212 1112 0.0 1015 1110 100 1367SAA005_AYOT11_bis-FD1.b1.ab1_1118 Beijerinckia_fluminensis_T|UQM_1685|EU401907 98.749 1119 0.0 1076 1118 100 1367SAA005_AYOT11-FD1.b1.ab1_1069 Beijerinckia_fluminensis_T|UQM_1685|EU401907 99.147 1055 0.0 1027 1069 99 1367SAA005_CP6-FD1.b1.ab1_1117 Ensifer_saheli_T|LMG_7837|X68390 98.743 1114 0.0 1070 1117 99 1367SAA005_CP7_bis-FD1.b1.ab1_819 Ensifer_mexicanus_T|ITTG-R7|DQ411930 98.862 791 0.0 763 819 96 1367SAA005_CP7-FD1.b1.ab1_1035 Sinorhizobium_americanum_T|CFNEI_156|AF506513 98.551 1035 0.0 988 1035 99 1367SAA005_NXC126R-FD1.b1.ab1_1017 Ensifer_adhaerens_T|LMG_20216|AM181733 98.527 1018 0.0 972 1017 99 1367SAA005_PROO4-FD1.b1.ab1_539 Rhizobium_loessense_T|CCBAU_7190B|AF364069 97.407 540 0.0 493 539 100 1367SAA005_S11Sm-FD1.b1.ab1_1049 Pseudomonas_monteilii_T|CIP_104883|AF064458 98.763 1051 0.0 1010 1049 100 1367SAA005_S12-FD1.b1.ab1_899 Microbacterium_testaceum_T|DSM_20166|X77445 98.210 894 0.0 844 899 99 1367SAA005_S15-FD1.b1.ab1_1111 Microbacterium_testaceum_T|DSM_20166|X77445 98.115 1114 0.0 1048 1111 100 1367SAA005_S1CbTm-FD1.b1.ab1_683 Pseudomonas_monteilii_T|CIP_104883|AF064458 99.227 647 0.0 631 683 95 1367SAA005_S1-FD1.b1.ab1_542 Pseudomonas_plecoglossicida_T|FPC951|AB009457 99.815 542 0.0 539 542 100 1367SAA005_S7CbTm-FD1.b1.ab1_961 Citrobacter_freundii_T|DSM_30039|AJ233408 98.954 956 0.0 925 961 99 1367SAA005_S9-FD1.b1.ab1_1091 Cellulosimicrobium_funkei_T|W6122|AY501364 99.451 1092 0.0 1074 1091 99 1367SAA005_VNC15-FD1.b1.ab1_1062 Rhizobium_alamii_T|type_strain:GBV016|AM931436 99.052 1055 0.0 1023 1062 99 1367SAA005_VRP15_bis-FD1.b1.ab1_1124 Ensifer_terangae_T|type_strain:_LMG_7834|X68388 99.822 1125 0.0 1118 1124 100 1367SAA005_VRP15-FD1.b1.ab1_1112 Ensifer_terangae_T|type_strain:_LMG_7834|X68388 99.012 1113 0.0 1079 1112 100
This final blastn example demonstrates the importance of knowing our tools well, meaning that we should always take the time to carefully read the software’s documentation to understand and exploit all its key features.
Repeat the blastn-based classification exercise based on 16S rDNA sequences using 10 hits per query. Impose the same filters applied in the previous exercise.
Repeat the blastn-based classification exercise based on 16S rDNA sequences using the NCBI nucletotide BLAST portal. Read the BLAST-help documentation if you need assistance.
BLASTP is a fundamental program in genomic and phylogenomic research, as it is one of the main tools used to:
# Move into the UniProt_proteomes dir and extract the sequences from the provided tgz file
cd $wkdir && cd UniProt_proteomes
tar -xvzf UniProt_proteomes.tgz
The family of small Rab GTPases belongs to the Ras GTPase superfamily, which comprise over 150 human members, with evolutionary conserved orthologs found throughout Eukaryota. The superfamily is divided into five major branches on the basis of sequence and functional similarities: Ras, Rho, Rab, Ran and Arf.
Small GTPases share a common biochemical mechanism and act as binary molecular switches, being active when GTP is bound and inactive when GDP is bound to the protein. Ras family proteins function as monomeric G proteins. Variations in structure and post-translational modifications that dictate specific subcellular locations and the proteins that serve as their regulators and effectors allow these small GTPases to function as sophisticated modulators of a remarkably complex and diverse range of cellular processes.
Rab GTPases are critical in membrane trafficking and insert into specific membrane compartments via their C-terminal geranylgeranyl lipidic tails, serving as organelle identity posts. In their GTP-bound (active) form, Rabs recruit multiple and specific effector proteins, which interact with membrane tethering (e.g., HOPS) and fusion factors (SNAREs), thereby serving a critical role in controlling the specific docking and fusion of membrane-bound compartments along the endocytic, recycling, exocytic, phagocytic and autophagic pathways.
In our research group at CCG-UNAM we use evolutionary, comparative, and functional genomics approaches to study Acanthamoeba castellanii and Stenotrophomonas spp. to identify essential loci and the molecular mechanisms that underlie the origins of complex phenotypes such as virulence, focusing on the cellular microbiology of Acanthamoeba-Stenotrophomonas interactions. We have recently found that Stenotrophomonas maltophilia, a globally emerging multidrug-resistant opportunistic pathogen, can use Acanthamoeba trophozoites as an alternative replication niche.
Using a combination of bioinformatics, molecular, and cell-biological approaches, we discovered that S. maltophilia is resistant to digestion by A. castellanii, a free-living amoeba that we use in our lab as a professional phagocyte model (Rivera et al. (2024). More specifically, we found that:
Figure from: Rivera et al. (2024). Phylogenomic, structural, and cell-biological analyses reveal that Stenotrophomonas maltophilia replicates in acidified Rab7A-positive vacuoles of Acanthamoeba castellanii. Microbiology Spectrum, 2024 Mar 5;12(3):e0298823.. doi: 10.1128/spectrum.02988-23
Figure from: Rivera et al. (2024). Phylogenomic, structural, and cell-biological analyses reveal that Stenotrophomonas maltophilia replicates in acidified Rab7A-positive vacuoles of Acanthamoeba castellanii. Microbiology Spectrum, 2024 Mar 5;12(3):e0298823.. doi: 10.1128/spectrum.02988-23.
In the following exercise, we will use \(blastp\) to study the large family of Rab GTPases encoded by the Acanthamoeba castellanii genome, using the well-annotated and canonical RAB7A_HUMAN protein as the query.
UniProt is the world’s leading high-quality, comprehensive, and freely accessible protein sequence and functional information resource. Explore this extraordinary resource and read the following paper from the UniProt Consortium - UniProt: the Universal Protein Knowledgebase in 2023.
You may also like to have a look at Lussi YC, Magrane M, Martin MJ, Orchard S; UniProt Consortium. Searching and Navigating UniProt Databases. Curr Protoc. 2023 Mar;3(3):e700. doi: 10.1002/cpz1.700. PMID: 36912607 and Zaru R, Orchard S; UniProt Consortium. UniProt Tools: BLAST, Align, Peptide Search, and ID Mapping. Curr Protoc. 2023 Mar;3(3):e697. doi: 10.1002/cpz1.697.
Let’s explore the source FASTA files starting with the A. castellanii Neff proteome (ACACA.faa), which will be used to generate the BLASTDB
$ grep '>' ACACA.faa | head
>tr|L8GC88|L8GC88_ACACA Protein kinase domain/MORN repeatcontaining protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_179960 PE=4 SV=1 >tr|L8GCQ0|L8GCQ0_ACACA Uncharacterized protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_180790 PE=4 SV=1 >tr|L8GD08|L8GD08_ACACA Uncharacterized protein (Fragment) OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_338290 PE=4 SV=1 >tr|L8GE04|L8GE04_ACACA Uncharacterized protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_102890 PE=4 SV=1 >tr|L8GE21|L8GE21_ACACA CUE domain containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_191030 PE=4 SV=1 >tr|L8GEJ3|L8GEJ3_ACACA Uncharacterized protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_351910 PE=4 SV=1 >tr|L8GEQ1|L8GEQ1_ACACA Transcription initiation factor iid, 31kd subunit protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_389090 PE=3 SV=1 >tr|L8GFB4|L8GFB4_ACACA Uncharacterized protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_121250 PE=4 SV=1 >tr|L8GG36|L8GG36_ACACA RapGAP/RanGAP domain containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_387100 PE=4 SV=1 >tr|L8GGJ9|L8GGJ9_ACACA Ubiquitin-specific protease, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_190280 PE=4 SV=1
Note that the FASTA headers have a canonical structure and do not require any further editing before formatting with \(makeblastdb\).
makeblastdb -in ACACA.faa -dbtype prot -parse_seqids
ls -1
ACACA.faa ACACA.faa.pdb ACACA.faa.phr ACACA.faa.pin ACACA.faa.pog ACACA.faa.pos ACACA.faa.pot ACACA.faa.psq ACACA.faa.ptf ACACA.faa.pto DICDI.faa
blastp requires only two arguments to run: the -query
and the -db
, running under default
parameters.
blastp -query RAB7A_HUMAN.faa -db ACACA.faa | less
which produces the following (abbreviated/truncated) output, with the following sections:
BLASTP 2.14.1+ Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005. Database: ACACA.faa 14,938 sequences; 6,560,263 total letters Query= sp|P51149|RAB7A_HUMAN Ras-related protein Rab-7a OS=Homo sapiens OX=9606 GN=RAB7A PE=1 SV=1 Length=207 Score E Sequences producing significant alignments: (Bits) Value L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str... 293 2e-102 L8GPH2 Rab7/RabGfamily small GTPase OS=Acanthamoeba castellanii s... 249 4e-85 L8GXM9 GTPbinding protein YPTC5, putative OS=Acanthamoeba castell... 190 7e-62 L8GRY8 Rab7/RabGfamily small GTPase OS=Acanthamoeba castellanii s... 180 4e-58 L8GLL4 RAB family member (Rab7), putative OS=Acanthamoeba castell... 166 2e-52 L8HGU8 Ras family protein OS=Acanthamoeba castellanii str. Neff O... 154 6e-48 L8HEH5 Rab7, putative OS=Acanthamoeba castellanii str. Neff OX=12... 149 8e-46 L8H231 GTPbinding protein OS=Acanthamoeba castellanii str. Neff O... 142 2e-43 L8HDY4 Ras subfamily protein OS=Acanthamoeba castellanii str. Nef... 137 4e-41 L8GSP6 Ras-related protein Rab-2A, putative OS=Acanthamoeba caste... 134 7e-40 L8GKG8 Rasrelated protein Rab5, putative OS=Acanthamoeba castella... 133 2e-39 L8H896 RAB1B, member RAS oncogene family OS=Acanthamoeba castella... 132 6e-39 ... TRUNCATED L8HKA3 HEAT repeat domain containing protein OS=Acanthamoeba cast... 26.6 7.0 L8GYP1 Ras subfamily protein OS=Acanthamoeba castellanii str. Nef... 26.2 7.3 L8GLL9 WD domain, G-beta repeat-containing protein OS=Acanthamoeb... 26.6 7.5 L8H6G5 Uncharacterized protein OS=Acanthamoeba castellanii str. N... 26.2 8.7 L8HK33 GTPase OS=Acanthamoeba castellanii str. Neff OX=1257118 GN... 26.2 9.5 L8H1F4 Myosin VIIa, putative OS=Acanthamoeba castellanii str. Nef... 26.2 9.6 L8GRW1 Rab-like protein 6 OS=Acanthamoeba castellanii str. Neff O... 26.2 9.7 L8HDD1 Cobalamin synthesis protein/P47K OS=Acanthamoeba castellan... 26.2 9.9 L8H2D9 ABC transporter, ATPbinding domain containing protein OS=A... 26.2 9.9 >L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_074580 PE=4 SV=1 Length=186 Score = 293 bits (749), Expect = 2e-102, Method: Compositional matrix adjust. Identities = 149/208 (72%), Positives = 166/208 (80%), Gaps = 23/208 (11%) Query 1 MTSRKKVLLKVIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDRLVTMQ 60 M++RKKVLLK+IILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDD+LVT+Q Sbjct 1 MSTRKKVLLKIIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDKLVTLQ 60 Query 61 IWDTAGQERFQSLGVAFYRGADCCVLVFDVTAPNTFKTLDSWRDEFLIQASPRDPENFPF 120 IWDTAGQERFQSLGVAFYRGAD CVLVFDV A PRDPE FPF Sbjct 61 IWDTAGQERFQSLGVAFYRGADACVLVFDVN------------------AGPRDPETFPF 102 Query 121 VVLGNKIDLE-NRQVATKRAQAWCYSKNNIPYFETSAKEAINVEQAFQTIARNALKQETE 179 +VLGNKIDLE +R V+ KRAQ WC SK NIPYFETSAKEAINVEQAFQTIA+NA+K+E + Sbjct 103 IVLGNKIDLESSRVVSQKRAQTWCQSKGNIPYFETSAKEAINVEQAFQTIAKNAMKEEED 162 Query 180 VELYNEFPEPIKLDKNDRAKASAESCSC 207 V+L + I+LDK++ S CSC Sbjct 163 VDLAG-ITDAIQLDKSE---PSQSGCSC 186 ... TRUNCATED >L8GVA9 Fbox domain containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_042330 PE=4 SV=1 Length=226 Score = 27.7 bits (60), Expect = 2.3, Method: Compositional matrix adjust. Identities = 23/100 (23%), Positives = 40/100 (40%), Gaps = 15/100 (15%) Query 11 VIILGDSGVGKTSL-------MNQYVNKKFSNQYKAT----IGADFLTKEVMVDDRLVTM 59 +++LGD GVGKT+ + KK ++ G LT V + Sbjct 114 ILVLGDVGVGKTAYAERCLAGAQESTTKKKKGRFDMVALRGTGKHHLTLPTNYGP--VNI 171 Query 60 QIWDTAGQERFQSLGV--AFYRGADCCVLVFDVTAPNTFK 97 +WD R S GV G +++FD+ + +++ Sbjct 172 TLWDQPATSREISGGVRDELLAGGQAAIIMFDIASLQSYR 211 Lambda K H a alpha 0.318 0.132 0.386 0.792 4.96 Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6 Effective search space used: 603304980 Database: ACACA.faa Posted date: Oct 7, 2023 9:42 AM Number of letters in database: 6,560,263 Number of sequences in database: 14,938 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Neighboring words threshold: 11 Window for multiple hits: 40
Note: The header and footer sections contain essential information (m, n, k, \(\lambda\)) required to compute the E-value (see Karlin-Altschul equation \(E = kmne^{-{\lambda}S}\), as detailed in a previous section).
Remember, this critical information can be easily extracted from the standard BLAST output using, for example, an AWK one-liner, such as the following one:
blastp -query RAB7A_HUMAN.faa -db ACACA.faa | awk '/Database/, /total letters/; /^Lambda/, /Extension:/{seen++; print}{ if(seen>5) exit}'
Database: ACACA.faa 14,938 sequences; 6,560,263 total letters Lambda K H a alpha 0.318 0.132 0.386 0.792 4.96 Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6
Compute the P-value associated with the last alignment shown on the output above using R or a calculator.
The aim of this exercise is to identify A. castellanii Ras-superfamily members, using RAB7A_HUMAN as the query. Given the huge evolutionary distance separating humans from Amoebozoa (probably ~ 1.2 billion yrs), we are going to use the BLOSUM45 matrix to increase sensitivity.
Fig. 1 from Brocks et. al 2023. Nature 2023 Jun;618(7966):767-773. doi: 10.1038/s41586-023-06170-w.
Furthermore, since the Ras superfamily is expected to contain many members, possibly several hundred, we will request 300 hits using custom output fields in tabular format.
blastp -matrix BLOSUM45 -query RAB7A_HUMAN.faa -db ACACA.faa -outfmt '6 qseqid qlen sseqid slen stitle length pident gaps evalue bitscore qcovs' -num_alignments 300 > RAB7A_HUMAN_vs_ACACA_300hits.out
wc -l RAB7A_HUMAN_vs_ACACA_300hits.out
244
There are many (244) rows to explore. It is convenient to get simple summary statistics for key alignment parameters, such as: - E-value (column 9) - bit_score (column 10) - qcovs (column 11) - slen (column 4) - pident (column 7)
$HOME/bin
directory#!/usr/bin/env bash
# AUTHOR: Pablo Vinuesa. CCG-UNAM. @pvinmex
# little awk script to compute basic summary statistics
# receives the output of cut for a numeric column in a table
sort -n | awk '
BEGIN{ max = min = Mode = 'NaN' }
{
sum+=$1
a[x++]=$1
b[$1]++
if(b[$1]>hf){hf=b[$1]}
}
NR == 1 { min=$1; max=$1 }
$1 < min { min= $1 }
$1 > max { max= $1 }
END{ n = asort(a);idx=int((x+1)/2)
print "Min: " min
print "Mean: " sum/x
print "Median: " ((idx==(x+1)/2) ? a[idx] : (a[idx]+a[idx+1])/2)
for (i in b){if(b[i]==hf){(k=="") ? (k=i):(k=k FS i)}{FS=","}}
print "Mode: " k
print "Max: " max
}
'
~/$HOME/bin
directory to be able to use it as a system-wide
command/program.chmod 755 col_sumStats.sh && [ -d ~/$HOME/bin ] && mv col_sumStats.sh ~/$HOME/bin
Alternatively, you may fetch it with the following command from my GitHub repo directly into your
$HOME/bin
directory
# make sure that we have a bin directory in our home and cd into it
[ ! -d ~/bin ] && mkdir ~/bin
[ -d ~/bin ] && cd ~/bin
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/col_sumStats.sh
chmod 755 col_sumStats.sh
# takes you back again to your working directory
cd -
echo "# E-value"
cut -f 9 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# bit score"
cut -f 10 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# qcovs"
cut -f 11 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# pident"
cut -f 7 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# slen"
cut -f 4 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
# E-value Min: 6.25e-103 Mean: 0.250727 Median: 9.805e-15 Mode: 0.001 Max: 6.9 # bit score Min: 26.3 Mean: 72.9398 Median: 68.85 Mode: 27.5 Max: 292 # qcovs Min: 9 Mean: 68.6885 Median: 77 Mode: 78 Max: 100 # pident Min: 20.000 Mean: 31.7386 Median: 30.5675 Mode: 31.452 Max: 71.635 # slen Min: 94 Mean: 440.131 Median: 217.5 Mode: 193,195 Max: 2812
From the output above, we can conclude that there are many hits with poor E-values (>= 0.001) and scores (< 30). There is also great variation in the subject sequence length (slen) and query coverage (qcovs).
NOTE: if you did not place the
col_sumStats.sh script it to your $HOME/bin
directory, or any other one in the $PATH
, you will have to
execute the commands as follows:
echo "# E-value"
cut -f 9 RAB7A_HUMAN_vs_ACACA_300hits.out | ./col_sumStats.sh
...
The following AWK one-liner will print the the sequence length frequencies for those values found >=2 times, and provide a raw barplot for visual inspection.
awk -F"\t" '{l=$4; a[l]++; if (a[l]>max) max=a[l]} END {printf("Length\tFrequency"); for (i in a) if(a[i] >= 2){printf("\n%s\t%s\t",i,a[i]); for (j=0;j<(int(a[i]*(100/max)));j++) printf("*")} print ""}' RAB7A_HUMAN_vs_ACACA_300hits.out
Length Frequency 178 2 ********************************* 179 3 ************************************************** 180 3 ************************************************** 181 4 ****************************************************************** 184 5 *********************************************************************************** 186 5 *********************************************************************************** 188 3 ************************************************** 190 4 ****************************************************************** 193 6 **************************************************************************************************** 194 2 ********************************* 195 6 **************************************************************************************************** 196 3 ************************************************** 197 3 ************************************************** 198 3 ************************************************** 199 4 ****************************************************************** 200 3 ************************************************** 202 2 ********************************* 203 3 ************************************************** 204 3 ************************************************** 205 2 ********************************* 206 5 *********************************************************************************** 207 2 ********************************* 208 3 ************************************************** 209 2 ********************************* 210 3 ************************************************** 211 3 ************************************************** 213 3 ************************************************** 216 5 *********************************************************************************** 217 2 ********************************* 218 2 ********************************* 220 2 ********************************* 221 3 ************************************************** 223 2 ********************************* 228 4 ****************************************************************** 244 3 ************************************************** 264 2 ********************************* 266 2 ********************************* 273 2 ********************************* 421 2 ********************************* 514 2 ********************************* 558 2 ********************************* 578 2 ********************************* 988 2 *********************************
From the output above, it seems that most homologs fall in the length range between 178 and 250 residues.
R is ‘GNU S’, a freely available language and environment for statistical computing and graphics that provides various statistical and graphical techniques: linear and nonlinear modeling, statistical tests, time series analysis, classification, clustering, etc.
CRAN is a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. Currently, over 20000 contributed packages can be freely downloaded to boost your R coding and analysis capacity to unlimited heights.
R is currently the mainstream language and environment for statistical computing and graphics used by statisticians, data scientists, and researchers. Here we will be using only straightforward base R to make it easy for you to run the simple code examples.
# call R
R
# read data
dfr <- read.csv(file = "RAB7A_HUMAN_vs_ACACA_300hits.out", sep = "\t", header=FALSE)
# explore the structure of the object dfr
str(dfr)
# plot ACACA blastp subject length density
hist(dfr$V4, breaks = "Freedman-Diaconis", xlim = c(0, 3000), probability=TRUE, xlab="length", main="ACACA blastp subject length density")
abline(v = median(dfr$V4), col = "blue", lwd = 3)
abline(v = mean(dfr$V4), col = "red", lwd = 3)
lines(density(dfr$V4), col = 'green', lwd = 3)
dev.off()
which generates the following graph:
Figure 4. Histogram and density plot of RAB7A_HUMAN blastp hit lengths in the A. castellanii proteome. The median and mean lengths are highlighted as blue and red vertical bars, respectively.
eval ~ score
)It is interesting to analyze the distribution of E-values as a
function of bit scores (eval ~ score) to find the minimal alignment
scores required for a specific E-value cutoff, like 0.001. This can be
easily achieved calling base R’s plot()
function.
# Another interesting plot is that of E-values ~ bit-scores
plot(dfr$V9 ~ dfr$V10, main ="E-values ~ bitscores", ylab="E-values", xlab="bitscores")
dev.off()
# let's zoom-in by sub-setting the dfr for E-values <= 0.2 and scores <= 50, making use of the subset() function
dfr2 <- subset(dfr, V9 <= 0.2 & V10 < 50)
plot(dfr2$V9 ~ dfr2$V10, main ="E-values ~ bitscores", ylab="E-values", xlab="bitscores")
abline(h=0.001, col="red")
dev.off()
which generates the following graphs:
Figure 5. E-values as a function of scores
Figure 6. E-value ~ bitscores (zoom-in)
The latter figure shows that E-values remain below 0.001 for scores > ~38, which may be likely Ras superfamily homologs.
e2p()
functionUsing R and the “RAB7A_HUMAN_vs_ACACA_300hits.out” blastp
output generated in this section, we will add a column to the data frame
with the computed P-values for each E-value using the user-defined
function e2p()
. Then, we will generate a simple scatter
plot of E-values vs. P-values (Eval ~ Pval).
# 3. >>> Analyze the relation of E-values ~ P-values <<<
# 3.1 define the e2p function to compute the P-values from provided E-values in the x vector
# P-val=1-exp(-E)
e2p <- function(x) {
# compute the P-values from provided E-values in the x vector
# using the formula: P-val=1-exp(-E)
1-exp(-x)
}
# add a new variable V12 (P-val) to the dataframe
# which will hold the P-values computed by e2p
# from the E-values stored in V9
dfr$V12<- e2p(dfr$V9)
# zoom-in by sub-setting the dfr for E-values <= 0.001 and P-values <= 0.001
# making use of the subset() function.
# The following analysis
dfr2 <- subset(dfr, V9 <= 0.001 & V12 < 0.001)
c <- round(cor(dfr2$V9, dfr2$V12),3)
header <- paste0("E-values ~ P-values; ", "corr: ", c)
plot(dfr2$V9 ~ dfr2$V12, main ="E-values ~ P-values", ylab="E-values", xlab="P-values")
fit <- lm(dfr2$V9 ~ dfr2$V12)
abline(fit)
dev.off()
summary(fit)
# plot E-values ~ P-values
plot(dfr$V9 ~ dfr$V12, main ="E-values ~ P-values", ylab="E-values", xlab="P-values")
dev.off()
# plot two plots in one row
par(mfrow=c(1,2))
plot(dfr2$V9 ~ dfr2$V12, main ="E-values ~ P-values zoom-in; corr = 1", ylab="E-values", xlab="P-values")
abline(fit)
plot(dfr$V9 ~ dfr$V12, main ="E-values ~ P-values", ylab="E-values", xlab="P-values")
abline(fit)
dev.off()
par(opar)
# save to file
png(filename = "Eval_vs_Pval_with_zoomin_1x2.png", width = 960, height = 480)
par(mfrow=c(1,2))
plot(dfr2$V9 ~ dfr2$V12, main ="E-values ~ P-values zoom-in", ylab="E-values", xlab="P-values")
abline(fit)
plot(dfr$V9 ~ dfr$V12, main ="E-values ~ P-values", ylab="E-values", xlab="P-values")
abline(fit)
dev.off()
par(opar)
# quit
q(save="no")
summary()
function
for the fitted linear model summary(fit)
for the subsetted
dataset, corroborating our previous statement that at E-values < 0.01
the P-values are essentially equal to E-values.Call: lm(formula = dfr2$V9 ~ dfr2$V12) Residuals: Min 1Q Median 3Q Max -9.492e-08 2.572e-09 2.572e-09 2.572e-09 5.861e-08 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.572e-09 1.235e-09 -2.082 0.0386 * dfr2$V12 1.000e+00 9.285e-06 107747.141 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.735e-08 on 203 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.161e+10 on 1 and 203 DF, p-value: < 2.2e-16
E-value ~
P-value
of the subsetted data frame (Eval < 0.01 &&
P-val < 0.01) and the full data frame values, with the fitted linear
model on the subsetted data added to both graphs.
From the previous exploratory data analyses, we can conclude that the hits for likely canonical small Rab-GTPase family homologs should meet the following (relaxed) attributes:
- slen >= 178 && slen < 270 - E-value < 1e-10 - bit-score > 70 - qcovs > 72 - pident > 28
Filtering the table according to those criteria yields 80 hits:
awk 'BEGIN{FS=OFS="\t"; print "sseqid\tslen\tpident\tgaps\tevalue\tbitscore\tqcovs"} ($4 > 178 && $4 < 270) && $7 >= 28 && $9 < 1e-3 && $10 > 70 && $11 > 72 {print $3,$4,$7,$8,$9,$10,$11}' RAB7A_HUMAN_vs_ACACA_300hits.out | column -t -s $'\t'
sseqid slen pident gaps evalue bitscore qcovs tr|L8HHF3|L8HHF3_ACACA 186 71.635 23 6.25e-103 292 100 tr|L8GPH2|L8GPH2_ACACA 205 58.537 2 1.20e-86 251 99 tr|L8GXM9|L8GXM9_ACACA 197 60.736 2 5.09e-64 193 78 tr|L8GRY8|L8GRY8_ACACA 186 47.343 28 1.02e-58 179 98 tr|L8GLL4|L8GLL4_ACACA 206 50.000 4 4.58e-54 168 78 ... TRUNCATED tr|L8GLE5|L8GLE5_ACACA 193 28.302 3 9.76e-18 74.2 77 tr|L8H8M5|L8H8M5_ACACA 228 28.986 27 1.72e-17 74.2 90 tr|L8GIH3|L8GIH3_ACACA 198 29.944 11 3.36e-17 73.0 86 tr|L8GIM0|L8GIM0_ACACA 203 34.177 7 9.07e-17 71.9 75 tr|L8HCN9|L8HCN9_ACACA 194 31.447 32 1.04e-16 71.6 77
To further validate the selected hits as canonical small Rab-GTPases, these should be aligned and subjected to phylogenetic analysis along with reference sequences. We will perform this analysis in later sections of the course.
The previous exercise identified a set of Rab-GTPase homologs from the A. castellanii proteome. Our next aim is to identify reciprocal best-hits or RBHs between a set of human reference Rabs retrieved from UNIPROT and saved in file HUMAN_63Rabs_Ran.faa and the A. castellanii proteome. The resulting RBHs are likely orthologs; therefore, we could transfer the annotation of the HUMAN Rabs to their Acanthamoeba RBHs.
The bioinformatics strategy to find RBHs:
Use symlinks to avoid unnecessary file copying to preserve precious disk space whenever possible.
mkdir RBHs
cd RBHs/
# make symlinks whenever possible, to avoid unnecessary file copying to avoid filling the server's disks
ln -s ../ACACA.faa .
ln -s ../ACACA.faa.* .
ln -s ../HUMAN_63Rabs_Ran.faa .
# set BLAST and filtering variables
mat=BLOSUM45
cols='6 qseqid sseqid pident gaps length qlen slen qcovs evalue score'
min_qcov=80
blastp -query HUMAN_63Rabs_Ran.faa -db ACACA.faa -matrix "$mat" -outfmt "$cols" -num_alignments 1 > HUMAN_63Rabs_vs_ACACA_besthits_m6.out
Note the following syntactic details:
<(your code here)
awk -F”\t” -v qcov=$min_qcov ‘$8 >= qcov{print $2}’
HUMAN_63Rabs_vs_ACACA_besthits_m6.out
to filter out
Acanthamoeba hits with qcov >=$min_qcov
.blastdbcmd -entry_batch <(awk -F"\t" -v qcov=$min_qcov '$8 > qcov{print $2}' HUMAN_63Rabs_vs_ACACA_besthits_m6.out) -db ACACA.faa > ACACA_vs_HUMAN_best_1hits.faa
makeblastdb -in HUMAN_63Rabs_Ran.faa -dbtype prot -parse_seqids
blastp -query ACACA_vs_HUMAN_best_1hits.faa -db HUMAN_63Rabs_Ran.faa -matrix "$mat" -outfmt "$cols" -num_alignments 1 > ACACA_vs_HUMAN_best_1hit_m6.out
# The following lines extract a sorted list (n=54) of unique Acanthamoeba identifiers
# found in the ACA_vs_HUMAN and HUMAN_vs_ACA searches
cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort | uniq -c
cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u
# This code greps out the sorted and unique ACACA hits from the ACACA_vs_HUMAN_best_1hit_m6.out table
while read ACAid; do grep -w "$ACAid" HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done < <(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u)
# Same as previous line, but sorting the BLAST output table by increasing E-values (in column 9; sort -gk9,9)
# output displayed below
while read ACAid; do grep -w "$ACAid" HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done < <(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u) | sort -gk9,9 | column -t
while read ACAid; do grep -w "$ACAid" HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done < <(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u) | sort -gk9,9 > HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted
sp|P62826|RAN_HUMAN tr|L8GF16|L8GF16_ACACA 82.524 0 206 216 213 95 7.64e-129 1259 sp|P61106|RAB14_HUMAN tr|L8H946|L8H946_ACACA 73.953 2 215 215 213 100 2.88e-117 1154 sp|P62491|RB11A_HUMAN tr|L8HJ43|L8HJ43_ACACA 75.829 0 211 216 213 98 9.90e-117 1150 sp|Q15907|RB11B_HUMAN tr|L8HJ43|L8HJ43_ACACA 75.463 8 216 218 213 98 2.83e-113 1119 sp|Q8WUD1|RAB2B_HUMAN tr|L8H2N2|L8H2N2_ACACA 70.046 2 217 216 216 100 4.48e-110 1090 sp|P61019|RAB2A_HUMAN tr|L8H2N2|L8H2N2_ACACA 69.585 6 217 212 216 100 2.42e-109 1083 sp|P20340|RAB6A_HUMAN tr|L8GN15|L8GN15_ACACA 81.287 0 171 208 195 82 1.25e-104 1037 sp|Q9NRW1|RAB6B_HUMAN tr|L8GN15|L8GN15_ACACA 75.843 0 178 208 195 86 1.36e-103 1028 sp|P51149|RAB7A_HUMAN tr|L8HHF3|L8HHF3_ACACA 71.635 23 208 207 186 100 6.25e-103 1021 sp|P61018|RAB4B_HUMAN tr|L8HCK2|L8HCK2_ACACA 65.888 11 214 213 204 100 2.29e-102 1018 sp|P20338|RAB4A_HUMAN tr|L8HCK2|L8HCK2_ACACA 65.421 11 214 218 204 98 9.55e-99 986 sp|Q13637|RAB32_HUMAN tr|L8HAC7|L8HAC7_ACACA 68.681 2 182 225 220 80 9.57e-94 943 sp|P57729|RAB38_HUMAN tr|L8HAC7|L8HAC7_ACACA 69.886 1 176 211 220 83 1.31e-93 940 sp|P57735|RAB25_HUMAN tr|L8HJ43|L8HJ43_ACACA 59.330 4 209 213 213 96 2.16e-91 920 sp|P62820|RAB1A_HUMAN tr|L8HD31|L8HD31_ACACA 69.412 0 170 205 210 83 4.80e-91 915 sp|Q9H0U4|RAB1B_HUMAN tr|L8HD31|L8HD31_ACACA 69.412 0 170 201 210 85 6.47e-91 914 sp|Q92928|RAB1C_HUMAN tr|L8HD31|L8HD31_ACACA 66.864 0 169 201 210 84 5.35e-85 860 ... TRUNCATED
Note: if you need help with AWK and Linux for bioinformatics, visit my extensive AWK tutorial for bioinformatics here
awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print }' HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted | column -t
awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print }' HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted > HUMAN_63Rabs_vs_ACACA_RBHs.tsv
qcov >=
$min_qcov
.
sp|P62826|RAN_HUMAN tr|L8GF16|L8GF16_ACACA 82.524 0 206 216 213 95 7.64e-129 1259 sp|P61106|RAB14_HUMAN tr|L8H946|L8H946_ACACA 73.953 2 215 215 213 100 2.88e-117 1154 sp|P62491|RB11A_HUMAN tr|L8HJ43|L8HJ43_ACACA 75.829 0 211 216 213 98 9.90e-117 1150 sp|Q8WUD1|RAB2B_HUMAN tr|L8H2N2|L8H2N2_ACACA 70.046 2 217 216 216 100 4.48e-110 1090 sp|P20340|RAB6A_HUMAN tr|L8GN15|L8GN15_ACACA 81.287 0 171 208 195 82 1.25e-104 1037 sp|P51149|RAB7A_HUMAN tr|L8HHF3|L8HHF3_ACACA 71.635 23 208 207 186 100 6.25e-103 1021 sp|P61018|RAB4B_HUMAN tr|L8HCK2|L8HCK2_ACACA 65.888 11 214 213 204 100 2.29e-102 1018 sp|Q13637|RAB32_HUMAN tr|L8HAC7|L8HAC7_ACACA 68.681 2 182 225 220 80 9.57e-94 943 sp|P62820|RAB1A_HUMAN tr|L8HD31|L8HD31_ACACA 69.412 0 170 205 210 83 4.80e-91 915 sp|Q9NP72|RAB18_HUMAN tr|L8GYY7|L8GYY7_ACACA 58.163 2 196 206 209 95 4.10e-80 816 sp|P51151|RAB9A_HUMAN tr|L8GPH2|L8GPH2_ACACA 51.961 3 204 201 205 100 1.18e-76 784 sp|P51153|RAB13_HUMAN tr|L8HGZ3|L8HGZ3_ACACA 58.282 6 163 203 208 80 1.16e-71 739 sp|Q14964|RB39A_HUMAN tr|L8GSP6|L8GSP6_ACACA 51.124 6 178 217 200 82 8.44e-61 641 sp|Q13636|RAB31_HUMAN tr|L8GKG8|L8GKG8_ACACA 46.734 9 199 194 218 98 2.73e-58 618 sp|Q969Q5|RAB24_HUMAN tr|L8H2Q9|L8H2Q9_ACACA 50.549 18 182 203 211 83 6.20e-56 597
This final code runs all the filtering steps presented above for the two blastp output tables using a single neat one-liner
while read ACAid; do grep -w "$ACAid" HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done < <(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u) | sort -gk9,9 | awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print}' | column -t
summary())
on the
ACACA_vs_HUMAN_RBHs_m6.out output file to explore summary
statistics of Acanthamoeba vs. human RBHs. What is the observed
range of alignment length (length), pident,
E-value, and score for these Acanthamoeba
Rabs?In the previous exercise we learned how to identify potential orthologs among pairs of proteomes by implementing a standard reciprocal best hits (RBHs) approach. In this section, we will use the compute_RBH_clusters.sh Bash script to compute orthologous protein clusters automatically, saving them to disk as FASTA files.
The compute_RBH_clusters.sh script extends the approach demonstrated in previous sections, computing RBHs between a set of proteomes (protein FASTA files) and a single reference proteome, which is automatically selected as the smallest one, or defined by the user. The computed RBH clusters (FASTA FILES) are provided as core_clusters and nonCore_clusters.
The latest version of the script is freely available from my GitHub repository.
~/bin
directory with the following commands:# cd into ~/bin
if [ -d ~/bin ]; then cd ~/bin; else mkdir ~/bin && cd ~/bin; fi
# fetch script and make it executable
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/compute_RBH_clusters.sh
chmod 755 compute_RBH_clusters.sh
# return to our working directory
cd -
compute_RBH_clusters.sh
As seen in the output shown above, the user can customize the
script’s parameters, but the defaults have been carefully chosen to be
broadly useful. Depending on the available computing resources, you may
want to increase the number of threads to use for the blastp
processes with the -t
option.
To finish our study of Rab GTPase orthologs, we will use compute_RBH_clusters.sh to RBH clusters as FASTA files with the following codes, assuming that we are in the UniProt_proteomes directory.
We will run under the default parameterization displayed in the previous section.
# assuming that we are in the UniProt_proteomes directory, make the RBHs subdirectory and cd into it
mkdir RAB_RBHs && cd RAB_RBHs
# generate symlinks to the following source FASTA files
ln -s ../HUMAN_63Rabs_Ran.faa .
ln -s ../ACACA.faa .
ln -s ../DICDI.faa .
# run compute_RBH_clusters.sh under default parameters
compute_RBH_clusters.sh -R 1 -d .
The image above displays the messages that the script prints to STDOUT. A brief description of each of the numbered sections (1-7) is provided below. along with relevant code snippets from the script:
...TRUNCATED function select_ref { # Automatically selects the smallest genome as the reference. local ext ext=$1 # find the genome with the smallest number of gene|protein sequences for f in *."${ext}" do echo -ne "$f\t" grep -c '>' "$f" done | sort -k2g | awk 'NR == 1{print $1}' } ...TRUNCATED print_start_time && echo '# Selecting the smallest genome as the reference' # automatically select the smallest reference, if not provided as ARG if [[ -z "$ref" ]]; then ref=$(select_ref "$ext") echo " - Selected $ref as the reference genome" fi
# ------------------------------------------------------------------- # 2. run makeblastdb on all input proteomes with edited FASTA headers # ------------------------------------------------------------------- # Each proteome is used to create a blastp database using makeblastdb. print_start_time && echo "# Generating indexed blastp databases" for f in *."${ext}" do makeblastdb -in "$f" -dbtype prot -parse_seqids &> /dev/null done print_start_time && echo "# Generating the aliased blastp database allDBs ..." blastdb_aliastool -dblist_file <(printf '%s\n' "${infiles[@]}") -dbtype prot -out allDBs -title allDBs printf '%s\n' '-----------------------------------------------------------------------------------------------------'
function run_blastp { local task q db outfile task=$1 # default: blastp-fast q=$2 db=$3 outfile=$4 # parameterizations in part following suggestions by Julie E. Hernandez-Salmeron�and Gabriel Moreno-Hagelsieb # in BMC Genomics volume 21, Article number: 741 (2020); https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07132-6 blastp -task "$task" -query "$q" -db "$db" -matrix "$mat" -outfmt "$cols" -num_alignments $num_aln -num_threads $threads -qcov_hsp_perc "$qcov" \ -seg "$seg" -soft_masking "$mask" -best_hit_overhang "$best_hit_overhang" -best_hit_score_edge "$best_hit_score_edge" -use_sw_tback \ -evalue "$Eval" -out "$outfile" 2> /dev/null check_file "$outfile" }
The final step in this section is to sort each output table by increasing E-value and unique REF vs nonREF hits using AWK hashes, which will be the RBHs, as we have seen in detail in section 6.3.1.
#----------------------------------- # 3. Run and process pairwise blastp #----------------------------------- # For each non-reference proteome, blastp is run against the reference proteome # (REFvsGENO) and vice versa (GENOvsREF). for f in "${non_ref[@]}"; do ref_vs_geno_blastout=${ref%.*}vs${f%.*}_best_hits.tmp geno_vs_ref_blastout=${f%.*}vs${ref%.*}_best_hits.tmp print_start_time && echo "# Running: run_blastp $task ${ref}ed ${f}ed $ref_vs_geno_blastout" run_blastp "$task" "${ref}"ed "${f}"ed "$ref_vs_geno_blastout" # Retrieve the best nonREF proteome database hits using blastdbcmd, onlfy if qcov > \$qcov print_start_time && echo "# Retrieving the best hits from $ref_vs_geno_blastout with blastdbcmd ... " blastdbcmd -entry_batch <(awk -F"\t" -v qcov="$qcov" '$8 > qcov{print $2}' "$ref_vs_geno_blastout" | sort -u) -db "${f}" > "${ref%.*}vs${f%.*}"_besthits.faa check_file "${ref%.*}vs${f%.*}"_besthits.faa num_hits=$(grep -c '^>' "${ref%.*}vs${f%.*}"_besthits.faa) if ((num_hits == 0)) then echo "WARNING: no hits in ${ref%.*}vs${f%.*}_besthits.faa" rm "${ref%.*}vs${f%.*}"_besthits.faa continue fi print_start_time && printf '%s\n' "# Running: run_blastp $task ${ref%.*}vs${f%.*}_besthits.faa ${ref}ed $geno_vs_ref_blastout ..." run_blastp "$task" "${ref%.*}vs${f%.*}"_besthits.faa "${ref}"ed "$geno_vs_ref_blastout" # Sort the blastp output table from the preceding search by increasing E-values (in column 9) and decreasing scores (col 10) # & filter out unique REF vs nonREF RBHs using AWK hashes from the sorted blast output table with qcov > $qcov print_start_time && echo "# Filtering out unique REF vs nonREF RBHs from the sorted blast output table with qcov > $qcov" for GENOid in $(cut -f1 "$geno_vs_ref_blastout" | sort -u) do grep "$GENOid" "$ref_vs_geno_blastout" done | sort -gk9,9 -gk10,10 | \ awk -v qcov="$qcov" 'BEGIN{FS=OFS="\t"}{REFid[$1]++; GENOid[$2]++; if(REFid[$1] == 1 && GENOid[$2] == 1 && $8 > qvov) print }' > \ "${ref_vs_geno_blastout%.*}"_RBHs_qcov_gt"${qcov}".tsv check_file "${ref_vs_geno_blastout%.*}"_RBHs_qcov_gt"${qcov}".tsv done printf '%s\n' '-----------------------------------------------------------------------------------------------------'
#----------------------------------------------------------------------- # 4. Identify REF proteins shared by all tsv files holding pairwise RBHs #----------------------------------------------------------------------- # Find the intersections of REFs in all tsv files print_start_time && echo "# Computing the intersections of REF proteins in all tsv files holding pairwise RBHs ... " awk '{r[$1]++; if(r[$1] == ARGC-1) print $1}' ./*.tsv > REF_RBH_IDs.list [[ ! -s REF_RBH_IDs.list ]] && error "could not write REF_RBH_IDs.list" intersection_size=$(wc -l REF_RBH_IDs.list | awk '{print $1}') ((intersection_size > 0)) && echo -e "${LBLUE} Found $intersection_size core RBHs shared by ${#non_ref[*]} nonREF proteomes with the $ref reference proteome${NC}" ((intersection_size == 0)) && error "# ERROR: found $intersection_size core orhtologous genes among ${#infiles[*]} input proteomes ..." printf '%s\n' '-----------------------------------------------------------------------------------------------------'
#------------------------------------------------------------------------------------------------------------------------- # 5. Loop over tsv files and generate RBHs_matrix.tsv core_genome_clusters.tsv and nonCore_genome_clusters.tsv tables #------------------------------------------------------------------------------------------------------------------------- # Cluster Computation print_start_time && echo "# Computing clusters of homologous sequences ..." # The core_ref hash counts the instances of the REFERNCE_IDs in the RBH tables declare -A core_ref core_ref=() # The all_clusters hash is indexed by REFERNCE_IDs # and as its value holds the RBHs as a tab-separated # string of IDs from nonREF proteomes declare -A all_clusters all_clusters=() # Construct the all_clusters hash, indexed by reference proteome, # containing a value a string of tab-separated nonREF proteome RBH IDs. print_start_time && echo "# Populating the all_clusters hash ..." for t in *RBHs_*.tsv; do [ ! -s "$t" ] && error "file: $t does not exist or is empty" while read -r REF QUERY rest do # count the instances of REFERNCE_IDs in each RBHs_*.tsv tables (( core_ref["$REF"]++ )) if (( ${core_ref["$REF"]} == 1 )) then all_clusters["$REF"]="$QUERY" else all_clusters["$REF"]="${all_clusters[$REF]}\t$QUERY" fi done < "$t" || { echo "Failed to process file: $t"; exit 1; } # required test for set -e compliance done # 5.1 print the RBHs_matrix.tsv print_start_time && echo "# Printing the pangenome_matrix, core_genome_clusters, and nonCore_genome_clusters files ..." for key in "${!all_clusters[@]}" do echo -e "${key}\t${all_clusters[$key]}" done > RBHs_matrix.tsv check_file RBHs_matrix.tsv # 5.2 print the core_genome_clusters.tsv awk -v ninfiles="${#infiles[@]}" 'NF == ninfiles' RBHs_matrix.tsv > core_genome_clusters.tsv check_file core_genome_clusters.tsv # 5.3 print the nonCore_genome_clusters.tsv awk -v ninfiles="${#infiles[@]}" 'NF != ninfiles' RBHs_matrix.tsv > nonCore_genome_clusters.tsv check_file nonCore_genome_clusters.tsv warn printf '%s\n' '-----------------------------------------------------------------------------------------------------'
#----------------------------- # 6. Write cluster FASTA files #----------------------------- # 6.1 uses blastdbcmd but calling it in parallel with xargs # write the each line of the RBHs_matrix.tsv IDs to a tmpfile # to pass the list of tmpfiles to a parallel call of blastdbcmd print_start_time && echo "# Writing each line of the RBHs_matrix.tsv IDs to an idstmp file por parallelization ..." #initialize cluster counter c=0 while read -r -a ids do ((c++)) # write each line of the RBHs_matrix.tsv to a temporal file printf '%s\n' "${ids[@]}" > cluster_"${c}".idstmp done < RBHs_matrix.tsv || { echo "Failed to process file: RBHs_matrix.tsv"; exit 1; } # required test for set -e compliance # 6.2 Pass the list of tmpfiles to a parallel call of blastdbcmd, or, if not available, call it from xargs if command -v parallel &> /dev/null then print_start_time && echo " - Running blastdbcmd through parallel with -j \$(nproc) ..." find . -name '*.idstmp' | parallel --gnu -j $(nproc) 'blastdbcmd -db allDBs -dbtype prot -entry_batch {} -out {.}.fas' else # use the more portable find | xargs idiom to parallelize the blastdbcmd call of parallel is not found on host print_start_time && echo " - Running blastdbcmd in parallel with xargs using all available cores \$(nproc) ..." find . -name '*.idstmp' -print0 | xargs -0 -P $(nproc) -I % blastdbcmd -db allDBs -dbtype prot -entry_batch % -out %.fas # rename *.idstmp.fas cluster files with rename, if available if command -v rename &> /dev/null; then rename 's/\.idstmp//' *.fas fi fi # 6.3 filter out core and nonCore clusters print_start_time && echo "# Moving core and nonCore clusters to their respective directories ..." mkdir core_clusters || error "could not mkdir core_clusters" mkdir nonCore_clusters || error "could not mkdir nonCore_clusters" for f in cluster_*.fas do if [[ $(grep -c '^>' "$f") -eq "${#infiles[@]}" ]] then mv "$f" core_clusters/core_"${f}" || error "could not mv $f to core_clusters/core_${f}" else mv "$f" nonCore_clusters/nonCore_"${f}" || error "could not mv $f to nonCore_clusters/nonCore_${f}" fi done printf '%s\n' '-----------------------------------------------------------------------------------------------------'
Let us now explore the files generated by the script, starting by a listing of the directory’s contents:
# list the directory contentes
ls -ltr
lrwxrwxrwx 1 vinuesa vinuesa 23 oct 8 19:38 HUMAN_63Rabs_Ran.faa -> ../HUMAN_63Rabs_Ran.faa lrwxrwxrwx 1 vinuesa vinuesa 12 oct 8 19:38 ACACA.faa -> ../ACACA.faa lrwxrwxrwx 1 vinuesa vinuesa 12 oct 22 18:04 DICDI.faa -> ../DICDI.faa -rw-rw-r-- 1 vinuesa vinuesa 699 oct 27 18:58 HUMAN_63Rabs_RanvsACACA_best_hits_RBHs_qcov_gt60.tsv.gz -rw-rw-r-- 1 vinuesa vinuesa 767 oct 27 18:58 HUMAN_63Rabs_RanvsDICDI_best_hits_RBHs_qcov_gt60.tsv.gz -rw-rw-r-- 1 vinuesa vinuesa 1464 oct 27 18:58 RBHs_matrix.tsv -rw-rw-r-- 1 vinuesa vinuesa 800 oct 27 18:58 nonCore_genome_clusters.tsv -rw-rw-r-- 1 vinuesa vinuesa 664 oct 27 18:58 core_genome_clusters.tsv drwxrwxr-x 2 vinuesa vinuesa 4096 oct 27 18:58 core_clusters drwxrwxr-x 2 vinuesa vinuesa 4096 oct 27 18:58 nonCore_clusters
column -t RBHs_matrix.tsv
sp|Q9NP72|RAB18_HUMAN tr|L8GYY7|L8GYY7_ACACA sp|Q54GY8|RAB18_DICDI sp|Q15286|RAB35_HUMAN sp|P34140|RAB1B_DICDI sp|P61019|RAB2A_HUMAN sp|P36409|RAB2A_DICDI sp|Q9ULC3|RAB23_HUMAN sp|Q559X6|RAB2B_DICDI sp|P61006|RAB8A_HUMAN sp|P20790|RAB8A_DICDI sp|Q15907|RB11B_HUMAN sp|P36412|RB11A_DICDI sp|Q9H082|RB33B_HUMAN sp|O76173|RAB1C_DICDI sp|P62826|RAN_HUMAN tr|L8GF16|L8GF16_ACACA sp|P33519|RAN_DICDI sp|Q9NX57|RAB20_HUMAN sp|Q54RX9|RAB5B_DICDI sp|P57729|RAB38_HUMAN tr|L8HAC7|L8HAC7_ACACA sp|Q54QR3|RB32A_DICDI sp|Q9UL25|RAB21_HUMAN sp|P34142|RAB21_DICDI sp|P61106|RAB14_HUMAN tr|L8H946|L8H946_ACACA sp|P36410|RAB14_DICDI sp|P62491|RB11A_HUMAN tr|L8HJ43|L8HJ43_ACACA sp|P20339|RAB5A_HUMAN sp|Q86JP3|RAB5A_DICDI sp|P62820|RAB1A_HUMAN tr|L8HD31|L8HD31_ACACA sp|P34139|RAB1A_DICDI sp|P51151|RAB9A_HUMAN tr|L8GPH2|L8GPH2_ACACA sp|Q8WUD1|RAB2B_HUMAN tr|L8H2N2|L8H2N2_ACACA sp|Q14964|RB39A_HUMAN tr|L8GSP6|L8GSP6_ACACA sp|Q96AX2|RAB37_HUMAN sp|P20791|RAB8B_DICDI sp|Q969Q5|RAB24_HUMAN tr|L8H2Q9|L8H2Q9_ACACA sp|Q55FU9|RAB24_DICDI sp|O95755|RAB36_HUMAN tr|L8H339|L8H339_ACACA sp|Q13636|RAB31_HUMAN tr|L8GKG8|L8GKG8_ACACA sp|Q6IQ22|RAB12_HUMAN tr|L8GIC5|L8GIC5_ACACA sp|Q54NU2|RAB1D_DICDI sp|P61018|RAB4B_HUMAN tr|L8HCK2|L8HCK2_ACACA sp|Q54DA7|RAB4_DICDI sp|P51153|RAB13_HUMAN tr|L8HGZ3|L8HGZ3_ACACA sp|P20340|RAB6A_HUMAN tr|L8GN15|L8GN15_ACACA sp|Q55FK2|RAB6_DICDI sp|Q92930|RAB8B_HUMAN tr|L8HGB4|L8HGB4_ACACA sp|P51149|RAB7A_HUMAN tr|L8HHF3|L8HHF3_ACACA sp|P36411|RAB7A_DICDI
The RBH_matrix.tsv table is parsed based on the number of fields in each record to generate the core_genome_clusters.tsv (RBHs shared by the three proteomes) and nonCore_genome_clusters.tsv (RBHs shared only between the REF and a nonREF proteome) tables.
column -t core_genome_clusters.tsv
sp|Q9NP72|RAB18_HUMAN tr|L8GYY7|L8GYY7_ACACA sp|Q54GY8|RAB18_DICDI sp|P62826|RAN_HUMAN tr|L8GF16|L8GF16_ACACA sp|P33519|RAN_DICDI sp|P57729|RAB38_HUMAN tr|L8HAC7|L8HAC7_ACACA sp|Q54QR3|RB32A_DICDI sp|P61106|RAB14_HUMAN tr|L8H946|L8H946_ACACA sp|P36410|RAB14_DICDI sp|P62820|RAB1A_HUMAN tr|L8HD31|L8HD31_ACACA sp|P34139|RAB1A_DICDI sp|Q969Q5|RAB24_HUMAN tr|L8H2Q9|L8H2Q9_ACACA sp|Q55FU9|RAB24_DICDI sp|Q6IQ22|RAB12_HUMAN tr|L8GIC5|L8GIC5_ACACA sp|Q54NU2|RAB1D_DICDI sp|P61018|RAB4B_HUMAN tr|L8HCK2|L8HCK2_ACACA sp|Q54DA7|RAB4_DICDI sp|P20340|RAB6A_HUMAN tr|L8GN15|L8GN15_ACACA sp|Q55FK2|RAB6_DICDI sp|P51149|RAB7A_HUMAN tr|L8HHF3|L8HHF3_ACACA sp|P36411|RAB7A_DICDI
column -t nonCore_genome_clusters.tsv
sp|Q15286|RAB35_HUMAN sp|P34140|RAB1B_DICDI sp|P61019|RAB2A_HUMAN sp|P36409|RAB2A_DICDI sp|Q9ULC3|RAB23_HUMAN sp|Q559X6|RAB2B_DICDI sp|P61006|RAB8A_HUMAN sp|P20790|RAB8A_DICDI sp|Q15907|RB11B_HUMAN sp|P36412|RB11A_DICDI sp|Q9H082|RB33B_HUMAN sp|O76173|RAB1C_DICDI sp|Q9NX57|RAB20_HUMAN sp|Q54RX9|RAB5B_DICDI sp|Q9UL25|RAB21_HUMAN sp|P34142|RAB21_DICDI sp|P62491|RB11A_HUMAN tr|L8HJ43|L8HJ43_ACACA sp|P20339|RAB5A_HUMAN sp|Q86JP3|RAB5A_DICDI sp|P51151|RAB9A_HUMAN tr|L8GPH2|L8GPH2_ACACA sp|Q8WUD1|RAB2B_HUMAN tr|L8H2N2|L8H2N2_ACACA sp|Q14964|RB39A_HUMAN tr|L8GSP6|L8GSP6_ACACA sp|Q96AX2|RAB37_HUMAN sp|P20791|RAB8B_DICDI sp|O95755|RAB36_HUMAN tr|L8H339|L8H339_ACACA sp|Q13636|RAB31_HUMAN tr|L8GKG8|L8GKG8_ACACA sp|P51153|RAB13_HUMAN tr|L8HGZ3|L8HGZ3_ACACA sp|Q92930|RAB8B_HUMAN tr|L8HGB4|L8HGB4_ACACA
Finally, let us explore the FASTA files in the core_clusters directory holding the RBH clusters shared across the three proteomes analyzed:
cd core_clusters && ls
core_cluster_10.fas core_cluster_15.fas core_cluster_20.fas core_cluster_24.fas core_cluster_28.fas core_cluster_12.fas core_cluster_1.fas core_cluster_23.fas core_cluster_26.fas core_cluster_8.fas
for f in *fas; do echo "# $f" && cat $f; echo '------------------------------------------------------------------'; done
# core_cluster_10.fas >P57729 Ras-related protein Rab-38 OS=Homo sapiens OX=9606 GN=RAB38 PE=1 SV=1 MQAPHKEHLYKLLVIGDLGVGKTSIIKRYVHQNFSSHYRATIGVDFALKVLHWDPETVVRLQLWDIAGQERFGNMTRVYY REAMGAFIVFDVTRPATFEAVAKWKNDLDSKLSLPNGKPVSVVLLANKCDQGKDVLMNNGLKMDQFCKEHGFVGWFETSA KENINIDEASRCLVKHILANECDLMESIEPDVVKPHLTSTKVASCSGCAKS >L8HAC7 Ras-related protein Rab OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_126680 PE=3 SV=1 MAAEQEDGVTEYLYKVLVVGDIGTGKTSIIKRYVHNIFSMHYKSTIGVDFALKVINWDPKTIVRLQLWDIAGQERFGNMT RVYYKEAVGAFVVFDVTRVNTFEAVQKWKNDIDNKVTLPPDDRPIPVVLLANKCDLAKEGFARNAAQMDKYCEEHGFAGW EETSAKENINIDKAVKFLVSKILENDINRQNQLQKKDENIVKVGAGGPEPKKAGAGGCCG >Q54QR3 Ras-related protein Rab-32A OS=Dictyostelium discoideum OX=44689 GN=rab32A PE=1 SV=1 MSNNPADDEGYNEYLYKILVVGDIGTGKTSIIKRFVHNIFSMHYKSTIGVDFALKVINWDPKTEVRLQLWDIAGQERFGS MTRVYYKEAVGAMITFDVTRMSTFEAVAKWKADIDSKVTYGADEKPIPVVLLANKCDLGKDAFIKTANDMDKYCKDNGFI GWFETSAKENMNIEKAARFLVDHILKNDVRRNQPIEGTIQPGDLNKQPQPTSTGPSCCK ------------------------------------------------------------------ # core_cluster_12.fas >P61106 Ras-related protein Rab-14 OS=Homo sapiens OX=9606 GN=RAB14 PE=1 SV=4 MATAPYNYSYIFKYIIIGDMGVGKSCLLHQFTEKKFMADCPHTIGVEFGTRIIEVSGQKIKLQIWDTAGQERFRAVTRSY YRGAAGALMVYDITRRSTYNHLSSWLTDARNLTNPNTVIILIGNKADLEAQRDVTYEEAKQFAEENGLLFLEASAKTGEN VEDAFLEAAKKIYQNIQDGSLDLNAAESGVQHKPSAPQGGRLTSEPQPQREGCGC >L8H946 Ras-related protein Rab-14 OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_279530 PE=3 SV=1 MASFPYEYIFKYIIIGDMGVGKSCLLHQFTEKKFIPDSPHTIGVEFGTRIIEVMGKKIKLQIWDTAGQERFRAVTRSYYR GAAGALLVYDITRRPTYNHLTSWLTDARNLTNPNTVIMMIGNKKDLEDQRDVTYEEASNFAKENGLIFLEASAKTGENVE EAFLKTAKLIFQSVQDGSVDVGSDIGVTKKTPLVPPSGGSGGGAGDGAGGCSC >P36410 Ras-related protein Rab-14 OS=Dictyostelium discoideum OX=44689 GN=rab14 PE=1 SV=2 MSFPYEYIFKYIIIGDMGVGKSCLLHQFTENKFVPDSPHTIGVEFGTRIVDVNNKKIKLQIWDTAGQERFRAVTRSYYRG AAGALLVYDITRRITYNHLTTWLTDARNLTNPNTVIMLIGNKKDLEGQRDVTYEEASAFAKQNGLIFVESSAKTGENVEE AFLRTAKLIFQSVQEGNVDLIPDGGITKNPPQTITDKPQDASKCSC ------------------------------------------------------------------ # core_cluster_15.fas >P62820 Ras-related protein Rab-1A OS=Homo sapiens OX=9606 GN=RAB1A PE=1 SV=3 MSSMNPEYDYLFKLLLIGDSGVGKSCLLLRFADDTYTESYISTIGVDFKIRTIELDGKTIKLQIWDTAGQERFRTITSSY YRGAHGIIVVYDVTDQESFNNVKQWLQEIDRYASENVNKLLVGNKCDLTTKKVVDYTTAKEFADSLGIPFLETSAKNATN VEQSFMTMAAEIKKRMGPGATAGGAEKSNVKIQSTPVKQSGGGCC >L8HD31 Rasrelated protein ORAB-1, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_070630 PE=4 SV=1 MSDYDHLFKILMVGDSGVGKSSLLLRFTDDTFTDNFISTIGVDFKIRTVNLDGKVIKMQIWDTAGQERFRTITSSYYRGA HGVILVYDVTDQVSFNNARQWLTEIERYACGNVVKLLVGNKSDLVSKRVVSTATGKEFADQFHLPFIEASAKDGSNVKQA FMTLVKEVYEKVVGDSASSGGGSLGQSDNNKVNVAAAKPEKKKGGLKCIL >P34139 Ras-related protein Rab-1A OS=Dictyostelium discoideum OX=44689 GN=rab1A PE=2 SV=2 MNPDYHYLFKLLLIGDSGVGKSCLLLRFADDTYSESFISTIGVDFKIRTIELNGKTIKLQIWDTAGQERFRTITSSYYRG AHGIIVVYDVTDKLTFENVRQWLQEIDRFACENVNKLLVGNKSDLVAKKVVDFNTAKAFADSLQIPFLETSAKQSTNVEQ AFMTMATEIKNRLTASQPTQTVDKNKVVPGSSAPISPKSGCC ------------------------------------------------------------------ # core_cluster_1.fas >Q9NP72 Ras-related protein Rab-18 OS=Homo sapiens OX=9606 GN=RAB18 PE=1 SV=1 MDEDVLTTLKILIIGESGVGKSSLLLRFTDDTFDPELAATIGVDFKVKTISVDGNKAKLAIWDTAGQERFRTLTPSYYRG AQGVILVYDVTRRDTFVKLDNWLNELETYCTRNDIVNMLVGNKIDKENREVDRNEGLKFARKHSMLFIEASAKTCDGVQC AFEELVEKIIQTPGLWESENQNKGVKLSHREEGQGGGACGGYCSVL >L8GYY7 Rasrelated protein Rab-18-like, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_369270 PE=4 SV=1 MEGTTGSASHFDHLFKLLIIGDSSVGKSSILLRFTEDEFDDEHPVTIGVDFKVKTIQLGAKRINLTIWDTAGQEKFRSLT SSYYRGTQGIILVYDVSSRESFQHLSVWLNEIEMYCNNSDVVKLLVANKIDLGDRQVSREEGLAFAKSKAMVFIECSAKT KLGIQQAFEELVTKILETPSLYEKETKATGTVSMNQDSEEQQEGVCSYC >Q54GY8 Ras-related protein Rab-18 OS=Dictyostelium discoideum OX=44689 GN=rab18 PE=3 SV=1 MEEDKQYKVLLIGDSDVGKTSIVKRFSDDTFDEDLLCTIGVEFKMKEVKVDGKKVDLCIWDTAGQEKFRALISSYYRGAH GIILTYDVTKRESFDNLNYWLNEVENFANRSNLVKLLVGNKIDKENREVTREEGAEFAKKKAMLFIECSAKSKIGIQQAF EELAQKIIEIPQNTSSSQPKQRNTGSVKVEDEPDHNQGVCSC ------------------------------------------------------------------ # core_cluster_20.fas >Q969Q5 Ras-related protein Rab-24 OS=Homo sapiens OX=9606 GN=RAB24 PE=1 SV=1 MSGQRVDVKVVMLGKEYVGKTSLVERYVHDRFLVGPYQNTIGAAFVAKVMSVGDRTVTLGIWDTAGSERYEAMSRIYYRG AKAAIVCYDLTDSSSFERAKFWVKELRSLEEGCQIYLCGTKSDLLEEDRRRRRVDFHDVQDYADNIKAQLFETSSKTGQS VDELFQKVAEDYVSVAAFQVMTEDKGVDLGQKPNPYFYSCCHH >L8H2Q9 Ras family protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_269260 PE=4 SV=1 MASVDLKVVLLGQSDVGKTCLVDRYLNGKYQDTISPTVGAAFGAKKLYVGDRPLTVGVWDTAGAERFEAMSRLYYRAARA ACVCYDLTRAETWPKVKFWINELLSFEKECALYIVGTKADLLEEGVPRGVEASEVADYARQINAQVFDALLIPCPHVTFE TSSKKGYSVEDLFDTIVYDFVKRGAGGEVPLGGGGGGDVITPSEAPPACSC >Q55FU9 Ras-related protein Rab-24 OS=Dictyostelium discoideum OX=44689 GN=rab24 PE=3 SV=1 MTKTKIDLKVVLLGYASVGKTCIVTRYTSGQFGDTHTTIGGAFSSKRVVVGETEVLLGIWDTAGTERYQAVNVSYYRRAN AAIVCYDLTNRESWEKVTFWAEELTQNEPEIEIYIVGTKLDLIQQGDIKAVPEEEVKQTARRYKAHIFETSSRTGENVSL LFQTIAEDFCKRTNNGTNPVNSNPSNVVNVNTQTQKKKGGCC ------------------------------------------------------------------ # core_cluster_23.fas >Q6IQ22 Ras-related protein Rab-12 OS=Homo sapiens OX=9606 GN=RAB12 PE=1 SV=3 MDPGAALQRRAGGGGGLGAGSPALSGGQGRRRKQPPRPADFKLQVIIIGSRGVGKTSLMERFTDDTFCEACKSTVGVDFK IKTVELRGKKIRLQIWDTAGQERFNSITSAYYRSAKGIILVYDITKKETFDDLPKWMKMIDKYASEDAELLLVGNKLDCE TDREITRQQGEKFAQQITGMRFCEASAKDNFNVDEIFLKLVDDILKKMPLDILRNELSNSILSLQPEPEIPPELPPPRPH VRCC >L8GIC5 Ras family protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_399850 PE=4 SV=1 MNSDASDTPDYLLKIVIVGDSGVGKSNIMSRFTNDTFEESSKTTIGVAFATKNMAVETSNDSTKDCKKQVKLQVWDTAGQ ERYRALSSAYYRGAQGALVVYDITSRESLDSIPKWLEEIDKYCTQDVVVTLVGNKLDLSENRCVSVEEGKKVAARENMFF IETSAKDATNIEKAFTHLIKEIIQSNLSQANNDGKIAGKAPSDTGITLKPPEEPESDPNAPPQPGCSC >Q54NU2 Ras-related protein Rab-1D OS=Dictyostelium discoideum OX=44689 GN=rab1D PE=1 SV=1 MSMAPEHDFFFKILLIGDSGVGKSCLLLRFADDSWTDTHISTIGVDFKIKTLNLDGKTIKLQIWDTAGQERFRTITSSYY RGAQGIILVYDCTDQDSFTNVKQWMGEIDRYACENVNKLLVGNKTDLVNEKVVDSNQAKSFAESYGIPFIETSAKNATNV EECFISMARDIKNRLADIQETPKPDEVDIKSKNKTKSGGKKSFC ------------------------------------------------------------------ # core_cluster_24.fas >P61018 Ras-related protein Rab-4B OS=Homo sapiens OX=9606 GN=RAB4B PE=1 SV=1 MAETYDFLFKFLVIGSAGTGKSCLLHQFIENKFKQDSNHTIGVEFGSRVVNVGGKTVKLQIWDTAGQERFRSVTRSYYRG AAGALLVYDITSRETYNSLAAWLTDARTLASPNIVVILCGNKKDLDPEREVTFLEASRFAQENELMFLETSALTGENVEE AFLKCARTILNKIDSGELDPERMGSGIQYGDASLRQLRQPRSAQAVAPQPCGC >L8HCK2 RAB4, putative OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_036030 PE=4 SV=1 MAERYNYLFKFIIIGDAATGKSCLLHRFIDNKFMRDSTHTIGVEFGSKIINIQGNNIKLQIWDTAGQVTRSYYRGAAGAL LVYDITSRESYNHLSTWLEDARTLASAGIVIILVGNKKDLEDEREVTFIEASRFAQENNLMFLETSALTGDNVEEVFLKC ARSIFSKVEAKAVDPKDSGSGIVENNMARPGSAVATGGSDGCAC >Q54DA7 Ras-related protein Rab-4 OS=Dictyostelium discoideum OX=44689 GN=rab4 PE=3 SV=1 MGKPLYLVKFIIIGSQSVGKSCLLFRFIDNKFKSQSTHTIGVDFSSRVVDIQGKNVKLQIWDTAGQERFRSVVISYYRGS AGVALVYDVTNRESYNHITNWLSDVKSLASPDVTIILVGNKADLTEQREVTFLEASRIAQENGLLFMETSALTGEGVEEM FLKCTRTIMTKIDSGEVNLEHLGVQISSDSGNVTKKPGDSSNCSC ------------------------------------------------------------------ # core_cluster_26.fas >P20340 Ras-related protein Rab-6A OS=Homo sapiens OX=9606 GN=RAB6A PE=1 SV=3 MSTGGDFGNPLRKFKLVFLGEQSVGKTSLITRFMYDSFDNTYQATIGIDFLSKTMYLEDRTVRLQLWDTAGQERFRSLIP SYIRDSTVAVVVYDITNVNSFQQTTKWIDDVRTERGSDVIIMLVGNKTDLADKRQVSIEEGERKAKELNVMFIETSAKAG YNVKQLFRRVAAALPGMESTQDRSREDMIDIKLEKPQEQPVSEGGCSC >L8GN15 GTPbinding protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_139260 PE=4 SV=1 MAGSNPLAKYKLVFLGDQSVGKTSIITRFMYDTFDNTYQATIGIDFLSKTMYLEDRTIRLQLWDTAGQERFRSLIPSYIR DSSVAIVVYDITNRTTFLDTARWVEDIRAERGNDVILMLVGNKTDLADKRCVSAEEGESKAKEFGAMFIETSAKAGYNVK PLFRQVAAALPGIDPIIQKQNTRKWKPAPARCERL >Q55FK2 Ras-related protein Rab-6 OS=Dictyostelium discoideum OX=44689 GN=rab6 PE=3 SV=1 MESLSKYKLVFLGDQSVGKTSIITRFMYDTFDITYQATIGIDFLSKTMYLEDRTVRLQLWDTAGQERFRSLIPSYIRDSS VAIVVYDITNKVSFTNTIKWIEDVRNERGNNVVIMLVGNKTDLADKRQVSIEEGEAKAKEYDIMFTETSAKAGFNIKALF RKVASALPGIDSNSMTKNQHEIISEVDITDGGKPALKEGDGFCSSSRC ------------------------------------------------------------------ # core_cluster_28.fas >P51149 Ras-related protein Rab-7a OS=Homo sapiens OX=9606 GN=RAB7A PE=1 SV=1 MTSRKKVLLKVIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDRLVTMQIWDTAGQERFQSLGVAFYRG ADCCVLVFDVTAPNTFKTLDSWRDEFLIQASPRDPENFPFVVLGNKIDLENRQVATKRAQAWCYSKNNIPYFETSAKEAI NVEQAFQTIARNALKQETEVELYNEFPEPIKLDKNDRAKASAESCSC >L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_074580 PE=4 SV=1 MSTRKKVLLKIIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDKLVTLQIWDTAGQERFQSLGVAFYRG ADACVLVFDVNAGPRDPETFPFIVLGNKIDLESSRVVSQKRAQTWCQSKGNIPYFETSAKEAINVEQAFQTIAKNAMKEE EDVDLAGITDAIQLDKSEPSQSGCSC >P36411 Ras-related protein Rab-7a OS=Dictyostelium discoideum OX=44689 GN=rab7A PE=1 SV=1 MATKKKVLLKVIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKELMVDDRVVTMQIWDTAGQERFQSLGVAFYRG ADCCVLCYDVNVAKTFENLDSWRDEFLIQAGPRDPDNFPFVVLGNKIDLENQRVVSQKRAASWCQSKGNIPYFETSAKEA INVEQAFQTIARNAIKLEDGLVFPIPTNIQVIPEPQPAKSGCC ------------------------------------------------------------------ # core_cluster_8.fas >P62826 GTP-binding nuclear protein Ran OS=Homo sapiens OX=9606 GN=RAN PE=1 SV=3 MAAQGEPQVQFKLVLVGDGGTGKTTFVKRHLTGEFEKKYVATLGVEVHPLVFHTNRGPIKFNVWDTAGQEKFGGLRDGYY IQAQCAIIMFDVTSRVTYKNVPNWHRDLVRVCENIPIVLCGNKVDIKDRKVKAKSIVFHRKKNLQYYDISAKSNYNFEKP FLWLARKLIGDPNLEFVAMPALAPPEVVMDPALAAQYEHDLEVAQTTALPDEDDDL >L8GF16 GTP-binding nuclear protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_260190 PE=3 SV=1 MQGQVPTFKLILVGDGGVGKTTFVKRHLTGEFEKKYVATLGVEVHPLSFHTNFGPICFNVWDTAGQEKFGGLRDGYYIQG QCAIIMFDVTSRITYKNVPNWHRDLVRVCENIPIVLVGNKVDVKDRKVKAKAITFHRKKNLQYYDISAKSNYNFEKPFLF LARKLTNKADLALVESPALAPPEVALDMNQIKQYEQELSLAAQTPLPDEDEDL >P33519 GTP-binding nuclear protein Ran OS=Dictyostelium discoideum OX=44689 GN=ranA PE=1 SV=1 MAEKEQIKLVLVGDGGVGKTTFVQRHLTGEFEPRYIPTLGVSVHPLIFYTNFGKIHFNVWDTAGQEKFGGLRDGYYIQGN CAIIMFDVTSRISYKNVPNWHSDLTRVCENIPIVLCGNKVDVKDRKVKPSQIVFHRRYNLSYYDVSAKSNYNFEKPFVWL TSKLLGNKAVTLVQQPTLKLPETVLDSNLMSLYEKEVADAAALPLPEDNDDL
In summary, the compute_RBH_clusters.sh script is a helpful and efficient tool to compute clusters of core and non-core orthologs shared between a reference proteome an a collection of additional ones, including the generation of the corresponding FASTA files.
Use the compute_RBH_clusters.sh script to compute the RBHs shared between the HUMAN_25_ARF_and_ARL.faa proteins, and the ACACA.faa, and DICDI.faa proteomes.
mkdir ARL_ARF_RBHS && cd ARL_ARF_RBHS
ln -s ../HUMAN_25_ARF_and_ARL.faa .
ln -s ../ACACA.faa .
ln -s ../DICDI.faa .
AIMS & data:
Background info: The 3cass_amplicons.fna sequences were generated in our laboratory from diverse multidrug-resistant Enterobacteriaceae isolates recovered from polluted rivers in Morelos, Mexico. We used PCR to screen our collection of isolates for the presence of the intI integrase gene with the conserved HS458/HS459 primers, which is a hallmark of class-I integrons. Positive isolates were then subjected to PCR with the conserved CS-F & CS-R primers that amplify gene cassette arrays captured by integrons.
Figure 2. The structure of a class-I integron.
If you like, have a look at this nice video-protocol on Screening Foodstuffs for Class 1 Integrons and Gene Cassettes by Liette S. Waldron and Michael R. Gillings from the Department of Biological Sciences, Macquarie University. June 19, 2015 in J Vis Exp. 2015; (100): 52889. doi: 10.3791/52889.
# Move into the BLAST_DB_AA dir and extract the sequences in the tgz file
cd $wkdir && cd BLAST_DB_AA
tar -xvzf gene_discovery_and_annotation_using_blastx.tgz
# count sequences
grep -c integron_cassettes4blastdb.faa # 2297 sequences
# explore headers
grep '^>' integron_cassettes4blastdb.faa | head -20
>gi|17026024 |[Achromobacter xylosoxidans]|AXI2|||||||Japan:Maebashi||clinical strain; synonym:Alcaligenes xylosoxidans||blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|Iyobe_S.|Antimicrob.AgentsChemother.46_2014-2016_2002|12019129 >gi|17026026 |[Acinetobacter baumannii]|ABI1|||||||Japan:Maebashi||clinical strain||blaIMP-11|metallo-beta-lactamase IMP-11|738|AB074436|unpubl >gi|34482004 |[Vibrio cholerae O1]||||||||||||aadA1|aminoglycoside adenyltransferase|792|AB107663|unpubl >gi|38524487 |[Vibrio cholerae non-O1/non-O139]|||||Non O1/Non O139|||||||dfrA15|dihydrofolate reductase|474|AB113114|unpubl >gi|51699488 |[Escherichia coli]|D49||feces|||||China: Guangzhou||patient||aadA1|aminoglycoside adenylyltransferase|792|AB189263|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182 ...
# 1. change >gi|17026026 |[Acinetobacter baumannii]|ABI1... ==>
# >gnl|casseteDB|17026026 |[Acinetobacter baumannii]|ABI1...
# and
# 2. simplify the header a bit, to avoid the many empty fields
grep '>' integron_cassettes4blastdb.faa | head -1 | sed 's#|#\n#g' | cat -n
1 >gi 2 17026024 3 [Achromobacter xylosoxidans] 4 AXI2 5 6 7 8 9 10 11 Japan:Maebashi 12 13 clinical strain; synonym:Alcaligenes xylosoxidans 14 15 blaIMP-10 16 metallo-beta-lactamase IMP-10 17 741 18 AB074435 19 Iyobe_S. 20 Antimicrob.AgentsChemother.46_2014-2016_2002 21 12019129
# use an AWK one-liner to do the complete job. Use the list of |-delimited fields listed above to select some of the most relevant ones
# NOTE that we add an extra field in the following transformation: >gi|17026026 => >gnl|casseteDB|17026026;
# therefore we need to add one to the field numbers listed above in order to get the desired ones
awk 'BEGIN{FS=OFS="|"}/>/{sub(/^>gi/, ">gnl|casseteDB"); print $1,$2,$3,$4,$16,$17,$18,$19,$22}!/>/{print $1}' integron_cassettes4blastdb.faa | grep '^>' | head -5
>gnl|casseteDB|17026024 |[Achromobacter xylosoxidans]|blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|12019129 >gnl|casseteDB|17026026 |[Acinetobacter baumannii]|blaIMP-11|metallo-beta-lactamase IMP-11|738|AB074436| >gnl|casseteDB|34482004 |[Vibrio cholerae O1]|aadA1|aminoglycoside adenyltransferase|792|AB107663| >gnl|casseteDB|38524487 |[Vibrio cholerae non-O1/non-O139]|dfrA15|dihydrofolate reductase|474|AB113114| >gnl|casseteDB|51699488 |[Escherichia coli]|aadA1|aminoglycoside adenylyltransferase|792|AB189263|16451182
# save the edited faa file as integron_cassettes4blastdb.faaED
awk 'BEGIN{FS=OFS="|"}/>/{sub(/^>gi/, ">gnl|casseteDB"); print $1,$2,$3,$4,$16,$17,$18,$19,$22}!/>/{print $1}' integron_cassettes4blastdb.faa > integron_cassettes4blastdb.faaED
# lets check the result a final time
grep '^>' integron_cassettes4blastdb.faaED | head -20
# 2) format DB with makeblastdb
# formatdb -i integron_cassettes4blastdb.faaED -p T -o T -n integron_cassetteDB # legacy blast
makeblastdb -in integron_cassettes4blastdb.faaED -dbtype prot -parse_seqids
Building a new DB, current time: 10/18/2022 19:04:31 New DB name: /home/vinuesa/cursos/intro_biocomp/sesion2_blast/BLAST_DB_AA/integron_cassettes4blastdb.faaED New DB title: integron_cassettes4blastdb.faaED Sequence type: Protein Deleted existing Protein BLAST database named /home/vinuesa/cursos/intro_biocomp/sesion2_blast/BLAST_DB_AA/integron_cassettes4blastdb.faaED Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 2297 sequences in 0.490659 seconds.
# List the output generated by makeblastdb, sorted by modification time in reverse manner
ls -ltr
-rw-r--r--. 1 vinuesa faculty 4505 ago 31 2011 3cass_amplicons.fna -rw-r--r--. 1 vinuesa faculty 924971 ago 31 2011 integron_cassettes4blastdb.faa -rw-r--r--. 1 vinuesa faculty 1129 jul 26 2019 unknown_prot.faa -rw-r--r--. 1 vinuesa faculty 751713 oct 10 20:15 gene_discovery_and_annotation_using_blastx.tgz -rw-r--r--. 1 vinuesa faculty 803081 oct 18 19:03 integron_cassettes4blastdb.faaED -rw-r--r--. 1 vinuesa faculty 566907 oct 18 19:04 integron_cassettes4blastdb.faaED.psq -rw-r--r--. 1 vinuesa faculty 9220 oct 18 19:04 integron_cassettes4blastdb.faaED.pog -rw-r--r--. 1 vinuesa faculty 18520 oct 18 19:04 integron_cassettes4blastdb.faaED.pin -rw-r--r--. 1 vinuesa faculty 329297 oct 18 19:04 integron_cassettes4blastdb.faaED.phr -rw-r--r--. 1 vinuesa faculty 9192 oct 18 19:04 integron_cassettes4blastdb.faaED.pto -rw-r--r--. 1 vinuesa faculty 27572 oct 18 19:04 integron_cassettes4blastdb.faaED.pot -rw-r--r--. 1 vinuesa faculty 16384 oct 18 19:04 integron_cassettes4blastdb.faaED.ptf -rw-r--r--. 1 vinuesa faculty 63027 oct 18 19:04 integron_cassettes4blastdb.faaED.pos -rw-r--r--. 1 vinuesa faculty 188416 oct 18 19:04 integron_cassettes4blastdb.faaED.pdb
The aim of the exercise is to assign functions or annotations to the problem proteins found in file unknown_prot.faa
# How many proteins sequences are stored in unknown_prot.faa
grep -c '^>' unknown_prot.faa
cat unknown_prot.faa
# How do we run blast? => explore the program's help output
blastp -h # short format
blastp -help # long format
# launch a standard blastp search, asking the program to provide a tabular format for the top 10 hits
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt 6 -num_alignments 10
#>>> Lets add some interesting information to the table, missing in the standard -outfmt 6 output
# run blastp using outfmt 6 and passing a user-specified formatting string
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out
# prepare a header file with the field names for easier interpretation of the blastp ouptput table
echo -e 'qseqid\tsseqid\tqlen\tslen\tqstart\tqend\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
cat header1.tab
# add the header to the output table
cat header1.tab blastp_10hits.out > blastp_10hits.tsv
column -t blastp_10hits.tsv | head -20
qseqid sseqid qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|87128428 266 267 1 266 1 266 266 100.000 266 0 0 0.0 537 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|145321080 266 267 1 266 1 266 266 99.624 266 1 0 0.0 536 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|44844209 266 267 1 266 1 266 266 99.624 266 1 0 0.0 536 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|102231612 266 267 1 266 1 266 266 99.248 266 2 0 0.0 535 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|38492196 266 267 1 266 1 266 266 90.226 256 26 0 0.0 497 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|118402723 266 267 1 266 1 266 266 89.850 255 27 0 0.0 494 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|66277418 266 267 1 266 1 266 266 89.098 252 29 0 3.06e-179 489 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|148455769 266 267 1 266 1 266 266 87.594 248 33 0 1.83e-176 483 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|30014072 266 266 1 266 1 265 266 73.684 223 69 1 6.02e-146 405 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|17026024 266 247 43 252 24 227 211 33.175 120 133 8 1.82e-37 128 79 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|56691997 390 387 5 390 1 386 386 99.482 384 2 0 0.0 789 99 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|146151072 390 383 5 388 1 382 388 45.361 251 202 10 2.45e-114 335 98 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|40287460 390 382 13 387 5 381 379 39.314 232 224 6 7.18e-96 287 96 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|133905118 390 410 13 387 33 409 379 39.578 231 223 6 5.97e-95 286 96 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|76057210 390 380 1 387 1 379 396 38.131 232 219 26 4.63e-88 267 99 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|146151112 390 276 316 380 99 165 67 22.388 34 50 2 2.4 25.8 17 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|133905089 390 126 4 37 2 35 34 38.235 22 21 0 5.8 23.9 9 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|161087277 390 80 133 201 15 78 79 25.316 34 34 25 6.9 23.1 18 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|133905132 390 101 255 287 71 98 33 36.364 18 16 5 7.4 23.1 8 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|133905029 390 337 100 154 247 287 55 21.818 24 29 14 7.8 24.3 14
As in the previous exercise on 16S rDNA sequence classification, for a confident annotation of a protein it is important to consider some basic alignment statistics.
In our particular case, since the query proteins correspond to gene cassettes encoding for antibiotic resistance, we are interested in identifying proteins down to the variant level. Therefore, we will impose very stringent filtering criteria to keep only hits with high pident and qcov.
# Print numbered list of fasta header fields to filter hits with pident >= 95% and qcovs >= 97%
sed 's/\t/\n/g' header1.tab | cat -n
1 qseqid 2 sseqid 3 qlen 4 slen 5 qstart 6 qend 7 sstart 8 send 9 length 10 pident 11 positive 12 mismatch 13 gaps 15 bitscore 16 qcovs
# Filter out hits with >= 95% pident && qcovs >= 97%
awk 'BEGIN{FS=OFS="\t"} $10 >= 95 && $16 >= 97' blastp_10hits.tsv | column -t
qseqid sseqid qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|87128428 266 267 1 266 1 266 266 100.000 266 0 0 0.0 537 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|145321080 266 267 1 266 1 266 266 99.624 266 1 0 0.0 536 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|44844209 266 267 1 266 1 266 266 99.624 266 1 0 0.0 536 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|102231612 266 267 1 266 1 266 266 99.248 266 2 0 0.0 535 100 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|56691997 390 387 5 390 1 386 386 99.482 384 2 0 0.0 789 99
# add a subject title (stitle), to get the description line of the proteins we are finding as top hits
echo -e 'qseqid\tsseqid\tqlen\tslen\tstitle\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out
cat header1.tab blastp_10hits.out > blastp_10hits.tsv
cut -f1,5,7,13 blastp_10hits.tsv | head -20 | column -ts $'\t'
qseqid stitle pident qcovs gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344| 100.000 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas putida]|blaVIM-6|metallo-beta-lactamase VIM-6|801|EF522838|18757180 99.624 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM|class B beta-lactamase|801|AJ628983| 99.624 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-2|metallo-beta-lactamase VIM-2|801|DQ522233| 99.248 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-4|metallo-beta-lactamase VIM-4|801|AY460181| 90.226 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-1|VIM-1 metallo-beta-lactamase|801|AJ969237| 89.850 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Enterobacter cloacae]|blaVIM-5|metallo-beta-lactamase VIM-5|801|DQ023222|16189133 89.098 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-13|class B carbapenemase VIM-13|801|EF577407|18644957 87.594 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-7|metallo-b-lactamase|798|AJ536835|14693560 73.684 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Achromobacter xylosoxidans]|blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|12019129 33.175 79 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793 99.482 99 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]||cmy-8|1149|EF382672|17526756 45.361 98 gi|9294830|gb|AAF86698.1|AF180959_1 |[Escherichia coli]|cmy-13|CMY-13|1146|AY339625|16048979 39.314 96 gi|9294830|gb|AAF86698.1|AF180959_1 |[Salmonella enterica subsp. enterica serovar Newport str. SL254]|blaCMY-2-1|CMY-2 beta-lactamase 1|1230|CP000604|17375195 39.578 96 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]|ampC|beta-lactamase class C DHA-1|1140|AJ971343|16436717 38.131 99 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]||sfpB|828|EF382672|17526756 22.388 17 gi|9294830|gb|AAF86698.1|AF180959_1 |[Salmonella enterica subsp. enterica serovar Newport str. SL254]||hypothetical protein|378|CP000604|17375195 38.235 9 gi|9294830|gb|AAF86698.1|AF180959_1 |[Salmonella enterica subsp. enterica serovar Choleraesuis]|orf20|Orf20|240|EU219534| 25.316 18 gi|9294830|gb|AAF86698.1|AF180959_1 |[Salmonella enterica subsp. enterica serovar Newport str. SL254]||conserved hypothetical protein|303|CP000604|17375195 36.364 8
#>>> retrieve the best hit found for each input gene
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 1 > blastp_tophits.out
cat header1.tab blastp_tophits.out > blastp_tophits.tsv
cut -f1,5,7,13 blastp_tophits.tsv | head | column -ts $'\t'
qseqid stitle pident qcovs gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344| 100.000 100 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793 99.482 99 gi|15705413|gb|AAL05630.1|AF395881_1 |[Klebsiella pneumoniae]|bla|extended-spectrum beta-lactamase CTX-M-19|876|AF458080|12936998 52.158 95
We can easily retrieve the top-scoring hits from the database using blastdbcmd using a list of qseqids, as shown below.
# get the sseqids only for the top hits with % sequence identity >= 95 and query coverage >= 97%
awk 'BEGIN{FS=OFS="\t"}NR > 1 && $7 >= 95 && $13 >= 97{print $2}' blastp_tophits.tsv > sseqids_for_top_hits_with_ge95pident_ge97qcovs.list
# finally retrieve the sequences for the sseqids_for_top_hits_with_ge95pident_ge97qcovs.list from the integron_cassettes4blastdb
blastdbcmd -db integron_cassettes4blastdb.faaED -entry_batch sseqids_for_top_hits_with_ge95pident_ge97qcovs.list -dbtype prot -out top_hits_with_ge95pident_ge97qcovs.faa
cat top_hits_with_ge95pident_ge97qcovs.faa
>casseteDB:87128428 |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344| MFKLLSKLLVYLTASIMAIASPLAFSVDSSGEYPTVSEIPVGEVRLYQIADGVWSHIATKSFDGAVYPSNGLIVRDGDEL LLIDTAWGAKNTAALLAEIEKQIGLPVTRAVSTHFHDDRVGGVDVLRAAGVATYASPSTRRLAEVEGSEIPTHSLEGLSS SGDAVRFGPVELFYPGAAHSTDNLVVYVPSASVLYGGCAIYELSRTSAGNVADADLAEWPTSIERIQQHYPEAQFVIPGH GLPGGLDLLKHTTNVVKAHTNRSVVE* >casseteDB:56691997 |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793 MQNTLKLLSVITCLAATVQGALAANIDESKIKDTVDDLIQPLMQKNNIPGMSVAVTVNGKNYIYNYGLAAKQPQQPVTEN TLFEVGSLSKTFAATLASYAQVSGKLSLDQSVSHYVPELRGSSFDHVSVLNVGTHTSGLQLFMPEDIKNTTQLMAYLKAW KPADAAGTHRVYSNIGTGLLGMIAAKSLGVSYEDAIEKTLLPQLGMHHSYLKVPADQMENYAWGYNKKDEPVHVNMEILG NEAYGIKTTSSDLLRYVQANMGQLKLDANAKMQQALTATHTGYFKSGEITQDLMWEQLPYPVSLPNLLTGNDMAMTKSVA TPIVPPLPPQENVWINKTGSTNGFGAYIAFVPAKKMGIVMLANKNYSIDQRVTVAYKILSSLEGNK*
If you read the detailed documentation of \(blastp\) blastp -help
, you
will find under the output section that -outfmt 7 is a tabular with
comments.
Here a few lines of code to generate a standard tabular output with a
customized header, parsing the -outfmt 7
output. Note the
use of cmd <(command pipeline)
, a syntax called
process substitution, which allows to use the ouptut of \(command\ pipeline\) as if it were a file
for input into \(cmd\). This is
convenient, as it avoids the need of writing an intermediate file.
# 1. Run blast[p] with -outfmt 7 and the desired output fields
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '7 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blast_m7.tmp
# 2. use process substitution to parse the -outfmt 7 output file blast_m7.tmp, and generate the standard tabular optput with labeled columns
cat <(awk '/^# Fields:/ {sub(/# Fields: /, ""); gsub(/, /, "\t"); gsub(/ /, "_"); print $0; exit}' blast_m7.tmp) <(awk '!/^# /{print $0 }' blast_m7.tmp) | less
The aim here is to identify the possible gene-cassettes encoded in the PCR amplicons collected in 3cass_amplicons.fna, as explained in the previous background info section.
BLASTX is generally used to find CDSs (protein-coding sequences) in genomic DNA or transcripts. As we will see, BLASTX is a powerful gene-finding tool, but we will need to examine many alignments to ensure that all CDSs encoded on a given DNA segment are found. For large DNA strings (>5-10 kb), I would recommend to segment the string into shorter, overlapping pieces, before running BLASTX
Remember that BLASTX uses a DNA query, which is translated in its 6 frames using by default the universal code, and the resulting products are used to interrogate a protein database.
# run blastx asking for the top 3 hits in tabular output format and an evalue cutoff set at 1e-100
# adding relevant information to the table, missing in the standard -m 6 output, like qlen, selen, qframe & qcovs
echo -e 'qseqid\tsseqid\tqlen\tslen\tqstart\tqend\tqframe\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastx -query 3cass_amplicons.fna -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-100 -outfmt '6 qseqid sseqid qlen slen qstart qend qframe sstart send length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 3 -out 3blastX_hits_m6_header1.tab
# if 3blastX_hits_m6_header1.tab exists and has a non-zero size, then execute the following commands ...
[ -s 3blastX_hits_m6_header1.tab ] && cat header1.tab 3blastX_hits_m6_header1.tab > ed && mv ed 3blastX_hits_m6_header1.tab
head 3blastX_hits_m6_header1.tab | column -t
qseqid sseqid qlen slen qstart qend qframe sstart send length pident positive mismatch gaps evalue bitscore qcovs Am1 gnl|casseteDB|146151084 1706 273 887 1699 2 2 273 272 98.897 270 2 1 0.0 536 48 Am1 gnl|casseteDB|157674384 1706 270 899 1699 2 4 270 267 99.625 266 1 0 0.0 533 47 Am1 gnl|casseteDB|161087337 1706 268 899 1699 2 2 268 267 99.625 266 1 0 0.0 533 47 Ap18 gnl|casseteDB|90265366 1762 277 39 836 3 11 276 266 95.865 258 11 0 0.0 508 45 Ap18 gnl|casseteDB|84095097 1762 285 39 836 3 19 284 266 95.489 258 12 0 3.97e-180 506 45 Ap18 gnl|casseteDB|156628012 1762 285 39 836 3 19 284 266 95.113 257 13 0 2.43e-179 504 45 Ap26 gnl|casseteDB|156628012 945 285 16 870 1 1 285 285 99.649 285 1 0 0.0 573 90 Ap26 gnl|casseteDB|133905080 945 336 1 870 1 47 336 290 97.241 285 8 0 0.0 572 92 Ap26 gnl|casseteDB|84095097 945 285 16 870 1 1 285 285 99.649 284 1 0 0.0 572 90
As we saw in the output from the previous \(blastx\) run, the hits for each query sequence align only to a certain portion of the latter. We will need to substantially increase the number of hits to have a chance to get lower-scoring HSPs that align to other regions (smaller genes) of the DNA sequence used as query.
#>>> Lets try to improve the analysis by increasing the no. of hits for each sequence to 100
# and adding the stitle column to the output table
echo -e 'qseqid\tsseqid\tstitle\tqlen\tslen\tqstart\tqend\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs\tqframe' > header2.tab
#>>> print a numvered list of the positional indexes of each column
sed 's/\t/\n/g' header2.tab | nl
1 qseqid 2 sseqid 3 stitle 4 qlen 5 slen 6 qstart 7 qend 8 sstart 9 send 10 length 11 pident 12 positive 13 mismatch 14 gaps 15 evalue 16 bitscore 17 qcovs 18 qframe
blastx -query 3cass_amplicons.fna -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-10 -outfmt '6 qseqid sseqid stitle qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs qframe' -num_alignments 100 -out 100blastX_hits_m6_header2.tab
[ -s 100blastX_hits_m6_header2.tab ] && cat header2.tab 100blastX_hits_m6_header2.tab > ed && mv ed 100blastX_hits_m6_header2.tab
# print relevant table fields
cut -f1,2,6,7,11,17,18 100blastX_hits_m6_header2.tab | column -t
qseqid sseqid qstart qend pident qcovs qframe Am1 gnl|casseteDB|146151084 887 1699 98.897 48 2 Am1 gnl|casseteDB|157674384 899 1699 99.625 47 2 Am1 gnl|casseteDB|161087337 899 1699 99.625 47 2 Am1 gnl|casseteDB|10444107 908 1699 100.000 46 2 Am1 gnl|casseteDB|21898659 908 1699 99.621 46 2 ... Am1 gnl|casseteDB|38491398 908 1693 72.137 46 2 Am1 gnl|casseteDB|45502091 908 1516 87.685 36 2 Am1 gnl|casseteDB|10444105 3 500 100.000 29 3 Am1 gnl|casseteDB|115605632 3 500 99.398 29 3 Am1 gnl|casseteDB|48526059 3 500 99.398 29 3 Am1 gnl|casseteDB|56541576 3 500 98.795 29 3 Am1 gnl|casseteDB|110084170 899 1678 60.769 46 2 Am1 gnl|casseteDB|161087327 3 488 100.000 28 3 Am1 gnl|casseteDB|110588863 908 1678 60.311 45 2 ... Am1 gnl|casseteDB|10444106 612 902 98.969 17 3 Am1 gnl|casseteDB|6003516 612 902 97.938 17 3 Am1 gnl|casseteDB|56541577 612 902 97.938 17 3 Am1 gnl|casseteDB|21898658 612 902 94.845 17 3 Am1 gnl|casseteDB|14091025 588 902 78.095 18 3 Am1 gnl|casseteDB|110588887 612 902 78.351 17 3 Am1 gnl|casseteDB|38492198 588 887 77.000 18 3 Am1 gnl|casseteDB|134035531 612 887 76.087 16 3 Am1 gnl|casseteDB|82698267 612 875 76.136 15 3 Am1 gnl|casseteDB|82698261 612 902 76.289 17 3 Am1 gnl|casseteDB|111661582 18 488 36.709 28 3 Am1 gnl|casseteDB|164449571 18 488 38.608 28 3 ...
What can we conclude from the selected output lines shown above for Am1 hits?
#>>> This 1liner generates a non-redundant list of the qstart qend positions for each of the query sequences.
# These are extracted from the full table with $(cut -f1 100blastX_hits_m6_header2.tab | grep -v qseqid | sort -u)
# Then, for each query, grep it out of the table with grep $query 100blastX_hits_m6_header2.tab and keep only the fields 6,7: qstart, qend
# We separte and label the results corresponding to each query sequence using the echo "#>>> $query"; as header
# End the corresponding rerport with echo '------------------------';
for query in $(cut -f1 100blastX_hits_m6_header2.tab | sed '1d' | sort -u)
do
echo "#>>> $query"
grep $query 100blastX_hits_m6_header2.tab | cut -f6,7 | sort -u
echo '------------------------'
done
##### OUTPUT of the for loop from the previous block #>>> Am1 1361 1699 15 488 18 488 33 488 3 488 3 500 588 887 588 902 612 875 612 887 612 902 887 1699 899 1678 899 1699 908 1303 908 1516 908 1678 908 1684 908 1693 908 1696 908 1699 920 1699 ------------------------ #>>> Ap18 1046 1762 1052 1762 1079 1528 1079 1702 1085 1738 1088 1738 1118 1738 1130 1738 1139 1738 36 836 39 812 39 836 48 173 48 443 48 656 48 797 48 812 48 830 48 836 48 860 501 836 60 836 ------------------------ #>>> Ap26 1 192 16 867 16 870 1 870 532 870 58 870 61 870 70 870 79 204 79 474 79 687 79 861 79 864 79 867 79 870 91 870 ------------------------
# Identify the most likely gene start|end positions
grep Am1 100blastX_hits_m6_header2.tab | cut -f6,7 | sort | uniq -c | sort -nk2
1 3 488 4 3 500 2 15 488 13 18 488 2 33 488 1 588 887 1 588 902 1 612 875 1 612 887 6 612 902 1 887 1699 1 899 1678 6 899 1699 1 908 1303 1 908 1516 2 908 1696 35 908 1699 3 908 1693 4 908 1684 6 908 1678 7 920 1699 1 1361 1699
#>>> Lets explore this in more detail. We'll start with the Am1 query
# One class of hits seems to start at position 3, ending at around position 500
# What are the hits? Lets grep them out of the table;
# note the use of a regular expression '3[[:space:]]500' which denotes a 3 followed by a space or tab,
# followed by a 500 (i.e. our qstart qend coordinates)
grep Am1 100blastX_hits_m6_header2.tab | grep '3[[:space:]]500' | cut -f1,3,6,7,11,17,18 | column -ts $'\t'
# Or somewhat more flexibly, using AWK, which will allow us to include the header in the output, as shown below
awk 'BEGIN{FS=OFS="\t"}NR==1 || (/^Am1/ && $6 == 3 && $7 == 500){print $1,$3,$6,$7,$17,$18}' Am1 100blastX_hits_m6_header2.tab | column -ts $'\t'
qseqid stitle qstart qend qcovs qframe Am1 |[Serratia marcescens]|dhfrXII|dihydrofolate reductase type XII|498|AF284063| 3 500 29 3 Am1 |[Acinetobacter baumannii]|dhrfXII|dihydrofolate reductase|498|DQ995286|19365688 3 500 29 3 Am1 |[Salmonella enterica subsp. enterica serovar Choleraesuis]|dhfr|Dhfr|498|AY551331|15145503 3 500 29 3 Am1 |[Enterococcus faecalis]|dhfrXII|dihydrofolate reductase|498|AB196348|16451182 3 500 29 3
#>>> Retrieve the best hit gnl|casseteDB|10444105 found by the search in the previous step
blastdbcmd -db integron_cassettes4blastdb.faaED -entry 'gnl|casseteDB|10444105' -dbtype prot -out S_marcescens_dhrXII_casseteDB-10444105.faa
#>>> Retrieve all hits found by the search in the previous step for the Am1 gene starting at pos 899 and ending at pos 1699
grep Am1 100blastX_hits_m6_header2.tab | grep '899[[:space:]]1699'
grep Am1 100blastX_hits_m6_header2.tab | grep '899[[:space:]]1699' | cut -f2 > Am1_qs899_qe1699_hit_IDs.list
blastdbcmd -db integron_cassettes4blastdb.faaED -entry_batch Am1_qs899_qe1699_hit_IDs.list -dbtype prot -out Am1_qs899_qe1699_AadA_best_hits.faa
As we’ve learned in the example above, we will need to explore many
alignments from a BLASTX search in order to make sure that we
identify all CDSs on a DNA string. The blast-imager.pl
script is a very convenient tool to visualize results, when we have
large lists of hits matching on different parts of our query sequence,
exactly as in our case.
To run the script, we will first split the multifasta of query sequences into individual sequences, to generate graphs the the hit table of individual queries.
# 1. If workin on tepeu or buluc, copy required scripts to your $HOME/bin directory
[ ! -d $HOME/bin ] && mkdir $HOME/bin
cp /home/vinuesa/cursos/intro_biocomp/scripts/*.* $HOME/bin
# 2. if you don't have access to the servers @ LCG, fetch the scripts directly from the command line
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/blast-imager.pl
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/fasta_toolkit.awk
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/split_fasta.pl
chmod 755 blast-imager.pl fasta_toolkit.awk split_fasta.pl
# Split the multifasta into the individual source sequence files,
# that is, generate single fasta files from the multifasta source file
split_fasta.pl 3cass_amplicons.fna
ls -ltr | tail -3
-rw-r--r--. 1 vinuesa faculty 967 oct 18 21:54 Ap26.fa -rw-r--r--. 1 vinuesa faculty 1798 oct 18 21:54 Ap18.fa -rw-r--r--. 1 vinuesa faculty 1740 oct 18 21:54 Am1.fa
# Run the blast-imager.pl script
# i) first writing the tables to file
#for file in *fa
#do
# blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 -out ${file%.fa}_blastxout.tab;
# ./blast-imager.pl ${file%.fa}_blastxout.tab > ${file%.*}.png
# rm ${file%.fa}_blastxout.tab
#done
# ii) or directly to the graph, by piping the blastx oputput directly into the blast-imager script like so:
#for file in *fa; do blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 | ../../blast-imager.pl > ${file%.*}.png; done
# Run again, imposing some filtering to remove the poorest hits: pident > 98%; mismatches < 4; gaps < 3
# which will render a better figure
for file in *fa; do blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 | awk '$3 > 98 && $5 < 4 && $6 < 3' | blast-imager.pl > ${file%.*}.png; done
# Have a look at the results for the three query sequences using the "eye of GNOME - eog"
eog Am1.png &
eog Ap18.png &
eog Ap26.png &
# or all at once:
eog *.png &
Figure 3. The graphical output generated by
blast-imager.pl
for the 100 Am1 blastx hits
The graphical display shown above in Fig. 3 and the numeric coordinates extracted previously from the hit table, strongly suggest that there are 3 gene cassettes encoded in the amplicon generated for Am1.
1 3 488 4 3 500 2 15 488 13 18 488 2 33 488 1 588 887 1 588 902 1 612 875 1 612 887 6 612 902 1 887 1699 1 899 1678 6 899 1699 1 908 1303 1 908 1516 2 908 1696 35 908 1699 3 908 1693 4 908 1684 6 908 1678 7 920 1699 1 1361 1699
Looking closely at the coordinates and figure 3, we can write the following table with the likely gene coordinates
Table 1. Coordinates for gene cassettes encoded in the Am1 amplicon DNA sequence.
start | end | strand | note |
---|---|---|---|
3 | 500 | + | - |
612 | 902 | + | The 887 and 899 starts would overlap with the first gene; this is not typical for gene cassettes |
908 | 1699 | + | - |
We will use the fasta_toolkit.awk
script to extract the
three gene cassettes from the corresponding DNA string using the their
coordinates, as defined in Table 1.
fasta_toolkit.awk -R 3 -s 3 -e 500 Am1.fa
>Am1 ATGAACTCGGAATCAGTACGCATTTATCTCGTTGCTGCGATGGGAGCCAATCGGGTTATTGGCAATGGTCCTAATATCCCCTGGAAAATTCCGGGTGAGCAGAAGATTTTTCGCAGACTCACTGAGGGAAAAGTCGTTGTCATGGGGCGAAAGACCTTTGAGTCTATCGGCAAGCCTCTACCGAACCGTCACACATTGGTAATCTCACGCCAAGCTAACTACCGCGCCACTGGCTGCGTAGTTGTTTCAACGCTGTCGCACGCTATCGCTTTGGCATCCGAACTCGGCAATGAACTCTACGTCGCGGGCGGAGCTGAGATATACACTCTGGCACTACCTCACGCCCACGGCGTGTTTCTATCTGAGGTACATCAAACCTTCGAGGGTGACGCCTTCTTCCCAATGCTCAACGAAACAGAATTCGAGCTTGTCTCAACCGAAACCATTCAAGCTGTAATTCCGTACACCCACTCCGTTTATGCGCGTCGAAACGGCTAA
# write first CDS to Am1_cassette_CDSs.fna, and append with 2cnd and 3rd to it using >>
fasta_toolkit.awk -R 3 -s 3 -e 500 Am1.fa > Am1_cassette_CDSs.fna
fasta_toolkit.awk -R 3 -s 612 -e 902 Am1.fa >> Am1_cassette_CDSs.fna
fasta_toolkit.awk -R 3 -s 908 -e 1699 Am1.fa >> Am1_cassette_CDSs.fna
grep '^>' Am1_cassette_CDSs.fna
>Am1 >Am1 >Am1
perl -pe 'if(/^>(\S+)/){ $c++; $h=$1 . "_ORF" . $c; s/$1/$h/ }' Am1_cassette_CDSs.fna
perl -pe 'if(/^>(\S+)/){ $c++; $h=$1 . "_ORF" . $c; s/$1/$h/ }' Am1_cassette_CDSs.fna > ed && mv ed Am1_cassette_CDSs.fna
cat Am1_cassette_CDSs.fna
>Am1_ORF1 ATGAACTCGGAATCAGTACGCATTTATCTCGTTGCTGCGATGGGAGCCAATCGGGTTATTGGCAATGGTCCTAATATCCCCTGGAAAATTCCGGGTGAGCAGAAGATTTTTCGCAGACTCACTGAGGGAAAAGTCGTTGTCATGGGGCGAAAGACCTTTGAGTCTATCGGCAAGCCTCTACCGAACCGTCACACATTGGTAATCTCACGCCAAGCTAACTACCGCGCCACTGGCTGCGTAGTTGTTTCAACGCTGTCGCACGCTATCGCTTTGGCATCCGAACTCGGCAATGAACTCTACGTCGCGGGCGGAGCTGAGATATACACTCTGGCACTACCTCACGCCCACGGCGTGTTTCTATCTGAGGTACATCAAACCTTCGAGGGTGACGCCTTCTTCCCAATGCTCAACGAAACAGAATTCGAGCTTGTCTCAACCGAAACCATTCAAGCTGTAATTCCGTACACCCACTCCGTTTATGCGCGTCGAAACGGCTAA >Am1_ORF2 ATGTTTATTCAAACGGCATTTAGCTTTTCAGGCGTTATTCAGTGCCTGTTTTGCCTTTTTTCCGGGCTTCGCCTGCATGGGCTGCGCAGGTTTTCAGTCTTTTTGGCCTCTAGCCCTTGCGTAGCAAGCGCAAGCAGCTATCGTTTTTGCAGTGCTGTGCCGCCTCGGTGGCGCAGCGTTTTTTCACGGTTAGCGCCCGTCGCCAAATTCAAGTTATCCGTTTTGGCTTCTGGGTCTAACATTTCGGTCAAGCCGACCGCCATTCTGCGGTCGGCTTACCTCGCCCGTTAG >Am1_ORF3 ATGAGGGAAGCGGTGACCATCGAAATTTCGAACCAACTATCAGAGGTGCTAAGCGTCATTGAGCGCCATCTGGAATCAACGTTGCTGGCCGTGCATTTGTACGGCTCCGCAGTGGATGGCGGCCTGAAGCCATACAGCGATATTGATTTGTTGGTTACTGTGGCCGTAAAGCTTGATGAAACGACGCGGCGAGCATTGCTCAATGACCTTATGGAGGCTTCGGCTTTCCCTGGCGAGAGCGAGACGCTCCGCGCTATAGAAGTCACCCTTGTCGTGCATGACGACATCATCCCGTGGCGTTATCCGGCTAAGCGCGAGCTGCAATTTGGAGAATGGCAGCGCAATGACATTCTTGCGGGTATCTTCGAGCCAGCCATGATCGACATTGATCTAGCTATCCTGCTTACAAAAGCAAGAGAACATAGCGTTGCCTTGGTAGGTCCGGCAGCGGAGGAATTCTTTGACCCGGTTCCTGAACAGGATCTATTCGAGGCGCTGAGGGAAACCTTGAAGCTATGGAACTCGCAGCCCGACTGGGCCGGCGATGAGCGAAATGTAGTGCTTACGTTGTCCCGCATTTGGTACAGCGCAATAACCGGCAAAATCGCGCCGAAGGATGTCGCTGCCGACTGGGCAATAAAACGCCTACCTGCCCAGTATCAGCCCGTCTTACTTGAAGCTAAGCAAGCTTATCTGGGACAAAAAGAAGATCACTTGGCCTCACGCGCAGATCACTTGGAAGAATTTATTCGCTTTGTGAAAGGCGAGATCATCAAGTCAGTTGGTAAATGA
This operation can be easily performed with the help of
fasta_toolkit.awk
fasta_toolkit.awk -R 4 Am1_cassette_CDSs.fna > Am1_cassette_CDSs_translated.faa
cat Am1_cassette_CDSs_translated.faa
>Am1_ORF1 MNSESVRIYLVAAMGANRVIGNGPNIPWKIPGEQKIFRRLTEGKVVVMGRKTFESIGKPLPNRHTLVISRQANYRATGCVVVSTLSHAIALASELGNELYVAGGAEIYTLALPHAHGVFLSEVHQTFEGDAFFPMLNETEFELVSTETIQAVIPYTHSVYARRNG* >Am1_ORF2 MFIQTAFSFSGVIQCLFCLFSGLRLHGLRRFSVFLASSPCVASASSYRFCSAVPPRWRSVFSRLAPVAKFKLSVLASGSNISVKPTAILRSAYLAR* >Am1_ORF3 MREAVTIEISNQLSEVLSVIERHLESTLLAVHLYGSAVDGGLKPYSDIDLLVTVAVKLDETTRRALLNDLMEASAFPGESETLRAIEVTLVVHDDIIPWRYPAKRELQFGEWQRNDILAGIFEPAMIDIDLAILLTKAREHSVALVGPAAEEFFDPVPEQDLFEALRETLKLWNSQPDWAGDERNVVLTLSRIWYSAITGKIAPKDVAADWAIKRLPAQYQPVLLEAKQAYLGQKEDHLASRADHLEEFIRFVKGEIIKSVGK*
We will run BLASTP a final time to get the best functional annotations for the three ORFs (gene cassettes) identified via BLASTX in the Am1 amplicon.
# 1. run blastp with the translated products and get 10 top hits
echo -e 'qseqid\tsseqid\tqlen\tslen\tstitle\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastp -query Am1_cassette_CDSs_translated.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out
cat header1.tab blastp_10hits.out > ed && mv ed blastp_10hits.out
# 2. reduce the output table to relevant fields for easier visualization
sed 's/\t/\n/g' header1.tab | cat -n
1 qseqid 2 sseqid 3 qlen 4 slen 5 stitle 6 length 7 pident 8 positive 9 mismatch 10 gaps 11 evalue 12 bitscore 13 qcovs
awk 'BEGIN{FS="\t"; OFS=","; print "qseqid,qlen,slen,stitle,pident,bitscore,qcovs"}{print $1,$3,$4,$5,$7,$12,$13}' blastp_10hits.out > Am1_top10_blastp_hits_for_3ORFs.csv
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tqlen\tslen\tstitle\tpident\tbitscore\tqcovs"}{print $1,$3,$4,$5,$7,$12,$13}' blastp_10hits.out > Am1_top10_blastp_hits_for_3ORFs.tsv
qseqid,qlen,slen,stitle,pident,bitscore,qcovs Am1_ORF1,166,166,|[Serratia marcescens]dhfrXII|dihydrofolate reductase type XII|498|AF284063|unpubl,100.000,340,100 Am1_ORF1,166,166,|[Acinetobacter baumannii]|ZJB04|||dhrfXII|dihydrofolate reductase|498|DQ995286|Xu_H.|Curr.Microbiol.59_113-117_2009|19365688,99.398,338,100 Am1_ORF1,166,166,|[Salmonella enterica subsp. enterica serovar Choleraesuis]dhfr|Dhfr|498|AY551331|Huang_T.M.|Vet.Microbiol.100_247-254_2004|15145503,99.398,338,100 Am1_ORF1,166,166,|[Enterococcus faecalis]|||human male bone|China:GuangzhoudhfrXII|dihydrofolate reductase|498|AB196348|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182,98.795,337,100 Am1_ORF1,166,178,|[Salmonella enterica subsp. enterica serovar Choleraesuis]|OU7519|hybrid of pSCV50 and pSC138||tri|Tri|534|EU219534|unpubl,100.000,333,98 Am1_ORF1,166,158,|[Escherichia coli]||Rs411b||dhfr17|dihydrofolate reductase|474|DQ875874|Ozugumus_O.B.|J.Microbiol.45_379-387_2007|17978796,36.709,108,95 Am1_ORF1,166,163,|[Escherichia coli]dfr17|dihydrofolate reductase|491|AY828551|Zhang_H.|Microbiol.Immunol.48_639-645_2004|15383699,36.709,108,95 Am1_ORF1,166,158,|[Acinetobacter baumannii]||607460||dfrVII|dihydrofolate reductase|474|EU340417|Xu_X.|Int.J.Antimicrob.Agents32_441-445_2008|18757181,38.608,107,95 Am1_ORF1,166,183,|[Escherichia coli]DfrA17||549|AM886293|unpubl,36.709,108,95 Am1_ORF1,166,158,|[Escherichia coli]||clinical isolate||dhfrIb|dihydrofolate reductase type Ib|474|Z50805|unpubl,37.342,105,95 Am1_ORF2,97,97,|[Serratia marcescens]|unknown|291|AF284063|unpubl,98.969,183,100 Am1_ORF2,97,97,|[Klebsiella pneumoniae]|KpS15unknown|291|AF180731|unpubl,97.938,183,100 Am1_ORF2,97,97,|[Enterococcus faecalis]|||human male bone|China:Guangzhou|putative protein|291|AB196348|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182,97.938,181,100 Am1_ORF2,97,97,|[uncultured bacterium]|unknown|291|AY115474|Tennstedt_T.|FEMSMicrobiol.Ecol.45_239-252_2003|19719593,94.845,176,100 Am1_ORF2,97,97,|[uncultured bacterium]|||arable soil||hypothetical protein|291|DQ779005|Heuer_H|Environ.Microbiol.9_657-666_2007|17298366,78.351,137,100 Am1_ORF2,97,105,|[Aeromonas salmonicida subsp. salmonicida]|123-86 3101unknown|315|AF327731|L'Abee-Lund_T.M|Microb.DrugResist.7_263-272_2001|11759088,78.351,137,100 Am1_ORF2,97,101,|[Pseudomonas aeruginosa]|orfD|303|AY460181|unpubl,77.174,129,95 Am1_ORF2,97,93,|[Pseudomonas aeruginosa]||4677||MexicoorfD|hypothetical protein|279|EF184217|unpubl,76.087,127,95 Am1_ORF2,97,88,|[Salmonella enterica subsp. enterica serovar Enteritidis]||Homo sapiens||Brazil|unknown|266|DQ278190|Peirano_G.|J.Antimicrob.Chemother.58_305-309_2006|16782743,76.136,122,91 Am1_ORF2,97,97,|[Salmonella enterica subsp. enterica serovar Agona]||Homo sapiens||Brazil|unknown|291|DQ278189|Peirano_G.|J.Antimicrob.Chemother.58_305-309_2006|16782743,76.289,115,100 Am1_ORF3,264,273,|[Klebsiella pneumoniae]|NK29aadA2|819|EF382672|Chen_Y.T.|Antimicrob.AgentsChemother.51_3004-3007_2007|17526756,100.000,531,100 Am1_ORF3,264,264,|[Serratia marcescens]aadA2|streptomycin/spectinomycin adenyltransferase|792|AF284063|unpubl,100.000,530,100 Am1_ORF3,264,264,|[uncultured bacterium]aadA2|streptomycin/spectinomycin adenylyltransferase|792|AY115474|Tennstedt_T.|FEMSMicrobiol.Ecol.45_239-252_2003|19719593,99.621,529,100 Am1_ORF3,264,264,|[Aeromonas hydrophila]|CVM861||diarrheic pig|aada2|aminoglycoside adenyltransferase|792|AY522923|Poole_T.L.|J.Antimicrob.Chemother.57_31-38_2006|16339607,99.621,529,100 Am1_ORF3,264,268,|[Salmonella enterica subsp. enterica serovar Choleraesuis]|OU7519|hybrid of pSCV50 and pSC138||aadA2|AadA2|804|EU219534|unpubl,99.621,528,100 Am1_ORF3,264,270,|[Proteus mirabilis]|C04014|||aadA2|AadA2|810|EU006711|Boyd_D.A.|Antimicrob.AgentsChemother.52_340-344_2008|18025121,99.621,528,100 Am1_ORF3,264,264,|[Proteus mirabilis]aadA2|aminoglycoside adenyltransferase|792|AY681136|unpubl,99.621,528,100 Am1_ORF3,264,264,|[uncultured bacterium]|||arable soil|aadA2|aminoglycoside-3'-adenylyltransferase|792|DQ779004|Heuer_H|Environ.Microbiol.9_657-666_2007|17298366,99.621,527,100 Am1_ORF3,264,264,|[Salmonella typhimurium DT104]|H3380|Salmonella typhimurium phage type DT104||aadA2|streptomycin/spectinomycin adenyltransferase|792|AF071555|Briggs_C.E|Antimicrob.AgentsChemother.43_846-849_1999|10103189,99.242,526,100 Am1_ORF3,264,264,|[Escherichia coli]|W21-3||sewage of swine farm|China: GuangzhouaadA2|streptomycin/spectinomycin 3' adenyltransferase|792|DQ157750|unpubl,98.864,522,100
Table 2 Annotation of the gene-cassettes found in the Am1 amplicon
ORF | start | end | strand | gene | annotation |
---|---|---|---|---|---|
1 | 3 | 500 | + | dhfrXII | dihydrofolate reductase type XII |
2 | 612 | 902 | + | orfD | hypothetical protein |
3 | 908 | 1699 | + | aadA2 | aminoglycoside adenyltransferase type II |
Figure 4. Prevalence of class-1 integrons and distribution of identified gene cassette arrays among Enterobacteriaceae isolates recovered from clean and contaminated rivers in Morelos, Mexico. Results from Jazmín Ramos Madrigal’s bachelor thesis @lcgunam.
Repeat the BLASTX based identification of gene cassettes for the Ap18 and Ap26 amplicons, extract the corresponding CDSs, translate them and annotate them with BLASTP, writing a table with the single-best annotation for each ORF.