Homograph: exploring protein homology and orthology in whole genomes



(back to Main page)

Making your own Databases
Homograph uses very simple text files as input. For any comparison, two sets of files are required, I shall name them for now "LIST"s and "COMPARISON"s.

"LIST" files:
A "LIST" file contains the list of genes (in cromosomal order) for an organism, together with information such as functional annotation.
We are using PTT files from the NCBI as our "LIST" files. A Protein Table (PTT) file contains the location, strand, length, ID, generic name , COG assignation and functional annotation for every predicted protein in whole genomes.
You can download PTT files (.ptt) from the NCBI's ftp site.
If for some reason (to include non translated genes, for example) you wish to make your own "LIST" files, all they need to contain is a unique ID for each gene, and be in cromosomal order.
A couple of lines from E. coli's PTT file would look something like this:
2801..3733 + 310 16127997 thrB b0003 COG0083 homoserine kinase
3734..5020 + 428 16127998 thrC b0004 COG0498 threonine synthase
The columns are start..stop, strand, size, ID, gene, synonym, COG and product.
The numbers in red then, are the unique IDs for the proteins in question and are the only completely necessary part of the LIST files (see below).

"COMPARISON" files:
These files are simply a list (in any order) of pairs of genes that pass certain criteria. In this case, we are using BLAST expectancy values in order to have all the pairs of possible homologous proteins each pair of genomes.
Again, the format is very simple and only needs to contain each pair of genes (as unique IDs) that pass a certain cutoff and their score (E-value in this case).
Below is a hypothetical part of an E. coli vs E. coli "COMPARISON" file:
16127997 16127997 e-173
16127997 16127998 e-14
16127998 16127998 0.0
The red numbers (unique IDs) correspond to those in the "LIST" file, and this is how Homograph knows which dots to draw.
In order to make these "COMPARISON" files, whole genome BLASTs need to be performed. Actually this is easier than it could appear to be.

First, you will need to install BLAST and obtain all the protein sequences for the genomes you wish to compare. The latter can again be downloaded from the NCBI's ftp site (.faa files in this case).
Each of these files in FASTA format, can be converted into a BLASTable database using the included formatdb. For a file named E_coli_K12.faa you would do something like this:
formatdb -i E_coli_K12.faa -n ecoliK12
The bit in red will be the name of your new BLAST database for E. coli K12.
With these BLAST databases you can now use blastall to obtain all the proteins in a .faa file that match (with a certain E-value cutoff) those in the BLAST databases. The following:
blastall -e 0.001 -m 8 -p blastp -d ecoliK12 -i S_typhi.faa -o S_typhi-vs-E_coli_K12.hom
would BLAST S. typhi's proteins against your E. coli's database, returning matches above an expectation value of 0.001 and put these results in the output file named S_typhi-vs-E_coli_K12.hom in a fairly nice, tabular format courtesy of the -m 8 option.
We're almost done here... All that's missing, is to simplify the output file so that it just contains the minimum necessary to be a "COMPARISON" file. Of all the columns in the output, all we need are the IDs for each gene, and the E-value. In Linux, all you'd have to do is:
cut -f1,2,11 S_typhi-vs-E_coli_K12.hom | cut -c4-11,29,33-40,58- > S_typhi.E_coli.hom
The new file S_typhi.E_coli.hom now only contains the columns 1, 2, 11 from the original BLAST output (query name, match name and E-value respectively). The second cut is to remove everything except the ID from the query and match names. Just be careful here to make sure that the numbers actually correspond to the character numbers you want to keep (i.e. ID numbers and E-value).

Configuring Homograph
Just remember to modify your paths file accordingly if you wish to use personalized Databases.