These programs are distributed in the hope that they will be useful, but without any warranty;
without even the implied warranty of merchantability or fitness for any purpose. The entire risk
as to the quality and performance of the program is with the user.
The EASYMIFs and SITEHOUND programs have been written by Dario Ghersi. The SITEHOUND-web server was
developed by Marylens Hernandez. Both in the group of Roberto Sanchez in the Department of Structural and
Chemical Biology, Mount Sinai School of Medicine.
These programs have been developed in the context of research work supported by grants from the National
Science Foundation (NSF) and by the National Institutes of Health (NIH) to Roberto Sanchez. Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the authors and do not
necessarily reflect the views of the NSF or NIH.
Distribution of the programs is allowed only with the author’s written consent.
The molecular function of proteins is largely determined by their interaction with other molecules at binding sites on the protein surface. Thus, localization and characterization of a ligand-binding site can contribute to functional annotation of a protein; it can guide mutational experiments, and be useful in predicting or verifying interactions. The identification of ligand binding sites can also be an important part of the drug discovery process. Knowing the location of binding sites facilitates virtual screening for hits, lead optimization, and identification of features that influence the selectivity of binding.
EASYMIFs and SITEHOUND are two software tools that in combination enable the identification of binding sites in protein structures using an energy-based approach. EASYMIFs, is a simple Molecular Interaction Field (MIF) calculator; and SITEHOUND, a post processing tool for MIFs that identifies interaction energy clusters corresponding to putative binding sites. While these tools are most commonly used in combination, they can also be used separately. EASYMIFs can be used to calculate MIFs for binding site characterization, Quantitative Structure-Activity Relationship (QSAR) studies, selectivity analysis of protein families, pharmacophoric search, and other applications that require MIFs . SITEHOUND can be used to process the ouput from other MIF or Affinity Map calculation programs, in addition to EASYMIFs, such as GRID  and the Autogrid tool of the AutoDock software package .
Molecular Interaction Fields (MIFs) describe the spatial variation of the interaction energy between a target molecule and a specific probe, that usually represents a chemical group. Although the interaction energy field is, by definition, a continuous quantity, for computational convenience it is usually discretized on a three-dimensional orthogonal grid that surrounds the molecule of interest. The output of a MIF calculation is therefore represented by an energy map that provides information about the potential energy between the probe and the molecule under analysis. EASYMIFs aims to provide a simple and rapid way to characterize a protein structure from a chemical standpoint at the global or local level (e.g. around an active site), returning maps that can be loaded in a Molecular Graphics Software such as PyMol, VMD or Chimera. The calculations are carried out in vacuo using the GROMOS  force field and a distance dependent dielectric, as described in detail in section 5.1.
The purpose of SITEHOUND is to manipulate the output of the EASYMIFs program (and other programs such as Autogrid  and GRID ) in order to predict regions on protein structures that are likely to be involved in binding to small molecules or peptides. The approach is based on the Q-SiteFinder algorithm , but contains more options and improvements. The program first filters off all the grid points that have an energy value above a user-specified threshold (a negative value) and clusters them according to spatial proximity using single or average linkage agglomerative clustering (see Section 5.2). Subsequently, the Total Interaction Energy (TIE) of each cluster is computed and this value is used to rank the clusters, from the most negative to the least negative. The last step involves printing the results on text files and in the PDB and DX formats, that allow for graphical display of the results on the protein using standard molecular visualization tools (such as Chimera, PyMol or VMD). A convenient Web Interface is available at http://sitehound.sanchezlab.org that allows the user to input a PDB file and obtain the results automatically  (see Appendix B).
The auto.py script is a wrapper around EASYMIFs and SITEHOUND that automates all the steps required for
binding site identification from a PDB file: protein preparation, MIF calculation via EASYMIFs, and binding site
identification with SITEHOUND. The default values are tuned for the methyl probe (CMET; a.k.a. “carbon
probe”) and the phosphate oxygen probe (OP; a.k.a. “phosphate probe”). The auto.py script can be executed
with the following command line:
auto.py -i PDB -p PROBE_TYPE
where PDB corresponds to the input PDB file, and PROBE_TYPE to the selected probe (currently CMET and OP have been tested). Usually a -k option is added to remove existing hetero atoms (see 2.2). If the calculations are successful, the number of output files will be produced. The main output files are described in the example below, a detailed description of all output files can be found in chapters 3 (EASYMIFs) and 4 (SITEHOUND).
Binding site identification with the “carbon” (CMET) probe on the adenylate kinase structure can be performed
with the following command:
auto.py -i 1aky.pdb -p CMET -k
binding site identification with the “phosphate” (OP) probe only requires a change in the -p option:
auto.py -i 1aky.pdb -p OP -k
The output files are tagged with the corresponding probe name (e.g. 1aky_CMET_summary.dat
and 1aky_OP_summary.dat). Three of the output files are most frequently used: _summary.dat,
_predicted.dat, and _clusters.pdb. For descriptions of the remaining output files see sections 3.2 and
The _summary.dat file contains a list of all identified clusters (predicted binding sites) ranked by Total
Interaction Energy (TIE). Real binding sites usually rank among the top three clusters  and have TIE values
that stand out from the background.
The _predicted.dat file lists the residues in the neighborhood of the predicted binding sites. Each line in
the file corresponds to a list of residues that are within 4.0 Å of the cluster in the input PDB file. The cluster is
indicated in the first column. Only data for the first 10 clusters (ranked by TIE) is included in this
The _clusters.pdb file contains the grid points that contribute to each of the clusters. The format is that of
a PDB file, and can be used to display the clusters in molecular graphics programs such as PyMOL (see
Figure 2.1). Each cluster is represented as one residue in the PDB file. The residue names have
the format CXX, where XX is the cluster index (e.g cluster 1 has residue name C01). By default all
clusters are HETATM entries. Note that for some applications it may be necessary to convert them
to ATOM entries. This is the case, for example, to represent the clusters as surfaces in PyMOL (see
The auto.py script provides all the options necessary to control EASYMIFs and SITEHOUND. The complete list of
options supported by auto.py are listed in Table A.1. For more details on the meaning of some of these options
see the Chapters 3 and 4.
While EASYMIFs and SITEHOUND are commonly used in combination for binding site identification through the auto.py script, EASYMIFs can also be run separately to calculate Molecular Interaction Fields (MIFs) for binding site characterization, Quantitative Structure-Activity Relationship (QSAR) studies, selectivity analysis of protein families, pharmacophoric search, and other applications that require MIFs . This chapter describes how to run EASYMIFs directly (i.e. independently of auto.py); and contains a more detailed description of the EASYMIFs output files, and how to visualize them. Some details on the EASYMIFs methodology can be found in Section 5.1.
EASYMIFs has been tested under Linux, Mac OS X and Windows XP and is currently a command-line only
program. After downloading and uncompressing the package, the directory can be moved to any location. It is
necessary to copy the files ’atom_types.txt’ and ’ffG43b1nb.params’ (that can be found in the EasyMIFs
directory) to the directory from where EASYMIFs is called.
The steps necessary to carry out the calculations are just two:
As with any property calculated from a structure, the results are going to be as good as the quality of the
structure. For step 1 to be successful, the protein should not contain missing atoms or residues. Furthermore, it
is necessary to strip the protein of all heteroatoms (including water molecules). This can be accomplished by
using the ’-k’ flag in the ’prepare_pdb.py’ python script.
A typical run of EASYMIFs will be as follows:
easymifs -f=FILE.easymifs -p=PROBE -c=X,Y,Z -n=NX,NY,NZ -r=SPACING
where FILE is the name of the structure of interest, PROBE is one of the atom types described in the file ’atom_types.txt’ and listed in Appendix A, X,Y,Z are the coordinates used to center the box, NX,NY,NZ are the number of points in the three cartersian axes (must be an odd number) and SPACING is the spacing in Angstrom between the points in the grid (recommended values are 1.0 or 0.5 for more detailed calculations).
It is also possible to let the program determine the dimensions of a box large enough to enclose the whole
protein, with a clearance of 5Å from the protein in each direction (useful for binding site prediction). In the
latter case only the -f and -p options will be mandatory.
The probes that have been extensively tested are ’CMET’ (a methyl-carbon probe) and ’OP’ (oxygen of a phosphate group), but many more are available (such as hydroxyl oxygen, peptide nitrogen, metals, etc.). The complete list of probes can be found in the atom_types.txt file.
The following example contains a step-by-step description of an interaction energy map calculation
performed on the binding site of an D-allose binding protein (PDB code 1rpj). The .easymifs file can
be prepared by calling prepare_pdb.py, with the -k option to strip the pdb of all the HETATM
prepare_pdb.py -f 1rpj.pdb -k
Afterwards, the actual interaction energy calculation step can be carried out as follows:
easyMIFs -f=1rpj.easymifs -p=OW -c=3.91,7.66,11.63 -n=30,30,30 -r=0.5
The command above will focus the calculations on the binding site (on a 15 Å3 box) and return an interaction energy map with a resolution of 0.5 Å.
This file contains the interaction energy map computed by EASYMIFs. The header of the file contains information
about the dimensions, the center and the resolution of the box, followed by the actual energy values, arranged in
the standard DX format (X slow, Y medium, Z fast):
EASYMIFs produces Interaction Energy Maps in the ’dx’ format, that can be conveniently visualized in PyMOL, Chimera, VMD and other molecular graphics packages. The dx file is usually displayed as a contour plot, showing regions of space where the energy value is within a specified range. Figure 3.1 shows the example discussed above. EASYMIFs has been used to calculate an interaction energy map between the protein (in the binding site region) and an hydroxyl probe, shown in gold in the figure. The box around the binding site illustrates the boundaries of the box used in the calculations.
To load a .dx file in Chimera, go to Tools ⇒ Volume Data ⇒ Volume Viewer. A window with many options for manipulating .dx file will appear. A particularly convenient tool is a slide control that allows for easy contouring of the interaction energy map. For more information about displaying .dx files in Chimera, please consult the pertaining documentation.
While EASYMIFs and SITEHOUND are commonly used in combination for binding site identification through the
auto.py script, SITEHOUND can also be run separately to process the ouput from other MIF or Affinity Map
calculation programs, such as GRID  and the Autogrid tool of the AutoDock software package . See
Appendix C for instructions on how to calculate Affinity Maps with Autogrid. This chapter describes how to
run SITEHOUND directly (i.e. independently of auto.py); and contains a more detailed description of the
SITEHOUND output files, and how to visualize them. Some details on the SITEHOUND methodology can be found
in Section 5.2.
SiteHound is provided as a binary file for a variety of platforms (Windows, Mac-Universal, Linux) and it runs
from the command-line. The complete description of the parameters is provided below:
sitehound -f=MAP.C.map -t=autogrid -e=-0.3 -l=average -s=7.8
A typical combination (derived from repeated runs on a large set of different protein-ligand complexes)
-e=-0.3 -l=average -s=7.8 for small molecules
-e=-0.4 -l=single -s=1.1 for peptides or elongated small molecules
With maps computed with EASYMIFs (CMET probe) a typical cutoff value for the energy (-e option) is -8.9, whereas for the OP probe (phosphate) is -8.5. It is important to mention that the PDB file used to produce the map should be present in the same directory as the map file, since it will be used to determine which residues are in contact with the clusters.
The following example contains a step-by-step description of a SITEHOUND run on a dihydrofolate reductase
(PDB code 1s3v). The interaction energy map has been computed with EASYMIFs. Please refer to section 3.1.1
for an example. The following command carries out the actual cluster analysis on the interaction energy
sitehound -f=1s3v_CMET.dx -t=easymifs -l=average -e=-8.9 -s=7.8
and yields a set of files whose content is described below.
SiteHound generates different files that can be visually inspected or loaded into statistical packages (such as R)
for further analysis.
This file is used to store the points that have passed the energy filter and is arranged in the following
The first column contains a unique identifier for the point, the following three columns specify the cartesian coordinates of the point and the last column contains the interaction energy value at that particular point.
This file contains a summary for all the clusters computed by the program and is organized like
where the first column indicates the cluster index, the second column contains the TIE of that particular cluster, the third point specifies the total number of points that belong to the cluster and the last three columns contain the location of the Center of Energy of the cluster (which is the average of the coordinates of the points that belong to the clusters weighted by interaction energy)
This file contains a detailed description of the points contained in all the clusters. An example is reported
where the first column refers to the cluster index the point belongs to, the second columns reports the TIE of the cluster, the third column contains the energy of the point, the following three columns contain the cartesian coordinates of the point and the final column reports the unique index associated to the point.
This file lists the residues that are in contact with the clusters and that, therefore, have the potential to be
involved in binding. A residue is arbitrarily defined to be in contact with a cluster if it has at least one atom
within 4.0 Angstrom of a point of the cluster. Below is a typical example:
The first column specifies which cluster the residues are in contact with, followed by a list of residues, arranged by residue number and chain.
This file contains information about the clusters using the standard DX file (a format also used by the well known program APBS, used to compute electrostatic potential). Most visualization programs are able to handle this format. Figure 4.1 shows a snapshot of the protein 1s3v displayed in Chimera together with a .dx file containing the information about the clusters (please refer to section 4.1.1 to learn how this example was generated) and to section 3.2.1 for more information about DX files.
Another option to visualize the results of the calculations carried out by SITEHOUND is to use the ’_clusters.pdb’
file, that can be loaded in any Molecular Viewers. The clusters have residue name ’C’ followed by their ranking
number (for example the first cluster has residue name ’C01’), and the chain identifier associated to clusters is
the first available letter or number not already utilized by the structure used for the calculation. A few sample
lines are shown below:
SITEHOUND output can be displayed in most molecular modeling softwares, such as PyMol, Chimera and VMD.
Both PDB and DX files can be used. The example shown in figure 4.1 (taken from section 4.1.1) is rendered
using a DX file and the visualization tools that Chimera offers for handling this file type. Please refer to section
3.3 for more information about visualization.
EASYMIFs computes the potential energy between a chemical probe (represented by a particular atom type) and
the protein on a regularly spaced grid, using the following equation:
where the potential energy calculated for a probe at a point i in the grid is equal to the sum of a Lennard-Jones and an electrostatics term over all the atoms of the protein. rij represents the distance between the probe at point i in the grid and an atom j of the protein. The Lennard-Jones and the electrostatics term are expressed by the following two equations:
The C(12) and C(6) parameters in the Lennard-Jones term depend on the chosen probe and the particular atom type and are taken from a matrix of LJ-parameters distributed with the GROMACS package. The dielectric constant has been set to 138.935485. The distance-dependent dielectric sigmoidal function has been taken from Solmajer and Mehler and has the following form:
where A = 6.02944; B = e0A; e0 = 78.4; λ = 0.018733345; k = 213.5782. When the distance between the probe and an atom becomes less than 1.32Å, a dielectric constant of 8 is used. The parameters reported above for the distance-dependent dielectric have been taken from Cui et al.
The main idea implemented in SITEHOUND is to group the points of the interaction energy map that have passed
the energy filter into clusters and to rank them by TIE. It is important to understand the options related to the
clustering step in order to effectively use the program. The principles of clustering algorithms and the relevant
parameters used by SITEHOUND are discussed here.
The fundamental goal of a clustering algorithm can be considered as finding a partition of a set of points,
defined in a multidimensional space, according to some optimality criterion (usually, one seeks to minimize
intra-clusters distances and maximize inter-clusters distances). It is worth pointing out that the
problem is NP-complete, because one should calculate all the possible partitions of the points,
a combinatorial problem that scales with the factorial of the number of points. In practice, one
can resort to heuristics that make the problem amenable to computation and yield satisfactory
More formally, given:
as a set of m points belonging to an n dimensional space, we can define the following two quantities:
that represent the distance between two points x1 and x2 and the distance between two clusters R and S,
respectively. A natural choice for Dp in our problem is the simple euclidean distance between the
One of the most widely used heuristics to approach the clustering problem is to proceed from to the bottom
to the top by iteratively merging clusters until one cluster containing all the points is obtained. This is where the
Dc quantity plays a role, by defining the distance between clusters. The name linkage is commonly used to
indicate this quantity.
SITEHOUND incorporates two types of linkage, single and average, defined in the following way:
where the ∣∣ notation indicates the cardinality of the set (i.e. the number of points of the cluster).
Two important properties shared by these two linkages are the fact that the distance between clusters increases monotonically at each step. Therefore, it is possible to cut the partition at a particular level obtaining the corresponding clusters. In SITEHOUND this level is called spatial cutoff. The type of linkage used affects (to some extent) the shape of the clusters obtained. In general, it can be shown that single linkage tends to yield more elongated clusters, whereas with average linkage the shape of the clusters is closer to a sphere. From a practical point of view, using single linkage can be more meaningful with peptide binding sites or elongated ligands, whereas average linkage performs better with small chemicals. These effects are illustrated in Figure 5.1. In general, it is desirable to run the calculations with both types of linkage, and compare the results. In some instances, with average linkage the binding site is split in two regions, whereas single linkage will tend to show one single site. This information could be valuable in the context of ligand design, since the two regions that show up with average linkage could both be exploited by connecting two fragments with a linker.
 G. M. Morris, D. S. Goodsell, R. S. Halliday, R. Huey, W. E. Hart, R. K. Belew, A. J. Olson, and Nc. Automated docking using a lamarckian genetic algorithm and an empirical binding free energy function. Journal of Computational Chemistry, 19(14):1639–1662, 1998.
A streamlined web-based interface to carry out binding site detection using SITEHOUND is available at http://sitehound.sanchezlab.org. The interface (Figure B.1) can be used to upload a PDB structure, automatically perform the binding site detection and visualize the results of the calculations on a ribbon representation of the protein. The residues potentially involved in binding are also reported on a per-cluster basis, together with a summary of the main features of the clusters. From the results page (Figure B.1) the user can also download all the files that are produced by SITEHOUND and that are described in detail in Chapter 4. Furthermore, it is possible to download the ‘.map’ file produced by EASYMIFs or Autogrid, which can used by SITEHOUND to carry out the binding detection with combinations of parameters different from the default parameters used by the web server. SITEHOUND-web only allows for the processing of relatively small systems with default parameters. Larger systems, different parameters, and the processing of large numbers of files require the use of the standalone EASYMIFs and SITEHOUND programs described in this manual. For details on SITEHOUND-web see .
SITEHOUND can also use interaction energy maps produced by the Autogrid program, which can be downloaded from http://autodock.scripps.edu/resources/adt/index_html. An installation of Autodock Tools is also recommended, since the package comes with a script for generating PDBQT files (the format used by Autogrid) starting from a PDB.
The first step in binding site detection requires the calculation of an interaction energy with a Carbon probe
using the Autogrid program. It is recommended to remove from the PDB water molecules or other HETATM
records (heteroatoms such as ligands). Again, the Web Server interface to SITEHOUND carries out this step
automatically. Autogrid uses the PDBQT file format (an enhanced PDB format), that can be easily obtained
from a PDB file by using the prepare_receptor4.py script that comes with the Autodock Tools package. An
example of a typical usage is shown below:
prepare_receptor4.py -A hydrogens -r PDB
where PDB is the name of the PDB file that has to be processed. The ‘-A hydrogens’ forces the addition of polar hydrogens to the protein. If the protein is already protonated, neglect this option.
Once the pdbqt file has been successfully produced, it is necessary to create a gpf file that is used by the
Autogrid program to calculate the interaction energy map.
A convenient script to automate this step is available at ... and can be used in the following way:
create_gpf.py -r PDBQT -t TEMPLATE -c 5.0 -s 1.0
where PDBQT stands for the file generated in the previous step, TEMPLATE is a .gpf file that comes with the script and contains a set of standard parameters for autogrid and the options ‘-c’ and ‘-s’ specify the clearance of the box and the resolution of the grid in Angstroms, respectively. The script calculates the center of the protein and uses it to center the grid. The size of the box that encloses the protein is determined on the basis of the protein dimensions and the clearance encoded in the ‘-c’ option.
The values reported above are typical, but the user can explore other values, bearing in mind that the higher the resolution the larger the computational requirements in terms of time and space.
This step yields the .C.map file that is the input to SITEHOUND . In order to obtain this interaction energy map,
just type the following:
autogrid -p GPF -l GLG
where GPF is the .gpf file produced in the previous step and GLG file is the name of the file where autogrid stores a log of the calculation (any name will do). At the end, a .C.map file containing the interaction energy map will be generated.