Selection of biologically relevant binding sites
The procedure below is summarised in the following
flow chart.
Datasets of known binding sites in proteins are generally constructed
from 3D structures by selecting the sets of protein residues which interact
with ligands present in the structure.
We start by selecting the subset of PDB
(2)
structures containing at least one protein chain. Correct quaternary
structures suggested by PISA
() were used to
account for the fact that the quaternary arrangement of protein chains can
influence the definition of the binding sites. We originally used PQS
(4) for definition of
quaternary structures but that service was discontinued in August 2009.
A common problem when automatically constructing datasets of known
binding sites for small ligands is that it is not straightforward to
distinguish biologically relevant ligands from small molecules which appear
in the structure because they are present at high concentrations in the
solutions used to obtain the 3D structure.
The issue is complicated by the fact that these small molecules from
experimental solvents often bind in biologically important sites, mimicking
biologically important ligands. Because of this, simply ignoring any
molecular species which could be an experimental solvent molecule results in
missing huge quantities of potentially important data.
Here, we have developed a fully automated approach to filter out
biologically irrelevant ligands. Our approach is based on the assumption
that biologically important ligands should in general bind more specifically
to proteins, and therefore make more inter-atomic contacts with the
latter.
Clustering of nearby small molecules
In PDB files, the coordinates of small molecules are stored in the HETATM
records. These small molecules are commonly referred to as HETATM or HET
groups/molecules.
It is frequent that different HET groups cluster together in a binding
site, either because they are all important for the biological function of
the protein (e.g. magnesium ions and nucleotides in the
nucleotide-binding sites of many P-loop containing nucleotide hydrolases,
like in [PDB:1jed]), or because several small molecules located near others
can mimic biologically important ligand (e.g. sulfate ions which
mimic phosphates of nucleotides, like in [PDB:1odi]).
To account for this phenomenon, HET groups with heavy atoms (i.e.
non-hydrogen atoms) located within 4.0 Å from one another are
clustered together, in all PDB structures used. We refer to such a group
of nearby HET molecules as a "ligand".
Filtering out small ligands
We filter out all ligands (as defined in the previous paragraph) with
less than 10 heavy atoms. Most biologically irrelevant solvent molecules
have less than 10 heavy atoms, and excluding these helps to ensure that the
final dataset exclusively contains biologically relevant ligands.
Singling out biologically important interactions
We assume that biologically relevant ligands
make more inter-atomic contacts with proteins, and that irrelevant ligands
can be filtered out by imposing a cutoff number of inter-atomic
protein-ligand contacts.
The number of protein-ligand inter-atomic contacts were calculated using
the LPC software (5).
In order to derive the
cutoff number of inter-atomic contacts to be used for that purpose, we
manually verified the relevance of the ligands in samples of binding sites
for different ranges of numbers of contacts. Due to the size of the
dataset, manually investigation had to be limited to random subsets of
binding sites within each range of contact numbers.
As a first step of this procedure, the dataset was divided in ranges of
50 inter-atomic contacts, as indicated in Table 1.
Manual analysis was then completed on 1% of the binding sites in each
subset, resulting in a total of 423 binding sites being manually checked at
this step. Table 1 shows the ranges of numbers of
contacts used to define the subsets, the number of binding sites in each
subset, the number of binding sites that were checked manually, and the
fraction of binding sites found to be biologically relevant within each
range of number of contacts. In the range 50 to 100 contacts, the fraction
of biologically relevant protein-ligand interactions is almost double that
of the range 1 to 50, with 93 % of binding sites being biologically
relevant. The fraction then slowly increases in higher ranges.
Table 1.
Relation between fraction of biologically relevant
interactions and number of protein-ligand inter-atomic contacts (ranges
of 50 inter-atomic contacts)
Nb of contacts |
Nb of sites |
Nb checked |
Frelevant |
1 to 50 |
6825 |
68 |
0.54 |
51 to 100 |
15326 |
153 |
0.93 |
101 to 150 |
9786 |
98 |
0.94 |
151 to 200 |
6531 |
65 |
0.98 |
> 200 |
3932 |
39 |
0.98 |
We then investigated in more detail the subset of binding sites with
number of contacts between 30 and 80, dividing it in ranges of 10, to
identify a more precise cutoff value above which the fraction of relevant
interactions become larger than 90%. Table 2 shows
the number of binding sites in each ranges, the number of binding sites
checked manually (again 1%), and the fraction of binding sites found to be
relevant for each range. The fraction of biologically relevant binding
sites increases regularly with the number of contacts between protein and
ligand atoms, and reaches a plateau value larger than 95% above 70
inter-atomic contacts.
Table 2.
Relation between fraction of biologically relevant
interactions and number of protein-ligand inter-atomic contacts (ranges of
10 inter-atomic contacts)
Nb of contacts |
Nb of sites |
Nb checked |
Frelevant |
31 to 40 |
2145 |
21 |
0.10 |
41 to 50 |
2793 |
28 |
0.32 |
51 to 60 |
3287 |
33 |
0.64 |
61 to 70 |
3129 |
31 |
0.77 |
71 to 80 |
3293 |
33 |
0.97 |
Overall, about 600 binding sites were checked manually for their
biological significance. This is more than for any set of binding sites
ever investigated to date.
Selecting pairs of unbound/bound structures
A dataset used to benchmark binding site prediction methods should
ideally consist of proteins with one unbound structure to apply the
prediction method, and at least one bound structure to derive the reference
definitions of known binding sites. This is necessary to account for the
fact that proteins can undergo structural changes upon binding, and that
consequently, applying a binding site prediction method to a bound structure
from which the ligand is deleted does not reproduce appropriately situations
where the binding site location is truly unknown.
We get the subset of unbound structures from the PDB by selecting for
structures with at least one protein chain, no complex of different proteins
(i.e. no heteromers), no small ligands, and no nucleic acid chains.
We then remove redundancy from this subset using PISCES
(6)
with a sequence identity cutoff value of 25%.
PDB entries consisting of Cα traces were excluded from this
list, as well as non X-ray entries, and X-ray entries with a resolution
larger than 2.4 Å or an R-value larger than 0.25. These quality
criteria were imposed on the unbound structures to ensure that the benchmark
could be used for validating binding site prediction methods which
necessitate high quality structures.
The final benchmark is then produced by matching all proteins with an
unbound structure with at least one bound structure with a biologically
relevant ligand, derived as described above. Since the dataset relies on
simple numeric cutoffs, it can be updated automatically when new structures
are deposited in the PDB, and thus remain representative over time. |