university lille north of france LigASite database of binding sites
Notice: Undefined variable: citepisa in /home/ligasite/www/index.php on line 854

Selection of biologically relevant binding sites

The procedure below is summarised in the following flow chart.

Datasets of known binding sites in proteins are generally constructed from 3D structures by selecting the sets of protein residues which interact with ligands present in the structure.

We start by selecting the subset of PDB (2) structures containing at least one protein chain. Correct quaternary structures suggested by PISA () were used to account for the fact that the quaternary arrangement of protein chains can influence the definition of the binding sites. We originally used PQS (4) for definition of quaternary structures but that service was discontinued in August 2009.

A common problem when automatically constructing datasets of known binding sites for small ligands is that it is not straightforward to distinguish biologically relevant ligands from small molecules which appear in the structure because they are present at high concentrations in the solutions used to obtain the 3D structure.

The issue is complicated by the fact that these small molecules from experimental solvents often bind in biologically important sites, mimicking biologically important ligands. Because of this, simply ignoring any molecular species which could be an experimental solvent molecule results in missing huge quantities of potentially important data.

Here, we have developed a fully automated approach to filter out biologically irrelevant ligands. Our approach is based on the assumption that biologically important ligands should in general bind more specifically to proteins, and therefore make more inter-atomic contacts with the latter.

Clustering of nearby small molecules

In PDB files, the coordinates of small molecules are stored in the HETATM records. These small molecules are commonly referred to as HETATM or HET groups/molecules.

It is frequent that different HET groups cluster together in a binding site, either because they are all important for the biological function of the protein (e.g. magnesium ions and nucleotides in the nucleotide-binding sites of many P-loop containing nucleotide hydrolases, like in [PDB:1jed]), or because several small molecules located near others can mimic biologically important ligand (e.g. sulfate ions which mimic phosphates of nucleotides, like in [PDB:1odi]).

To account for this phenomenon, HET groups with heavy atoms (i.e. non-hydrogen atoms) located within 4.0 Å from one another are clustered together, in all PDB structures used. We refer to such a group of nearby HET molecules as a "ligand".

Filtering out small ligands

We filter out all ligands (as defined in the previous paragraph) with less than 10 heavy atoms. Most biologically irrelevant solvent molecules have less than 10 heavy atoms, and excluding these helps to ensure that the final dataset exclusively contains biologically relevant ligands.

Singling out biologically important interactions

We assume that biologically relevant ligands make more inter-atomic contacts with proteins, and that irrelevant ligands can be filtered out by imposing a cutoff number of inter-atomic protein-ligand contacts.

The number of protein-ligand inter-atomic contacts were calculated using the LPC software (5). In order to derive the cutoff number of inter-atomic contacts to be used for that purpose, we manually verified the relevance of the ligands in samples of binding sites for different ranges of numbers of contacts. Due to the size of the dataset, manually investigation had to be limited to random subsets of binding sites within each range of contact numbers.

As a first step of this procedure, the dataset was divided in ranges of 50 inter-atomic contacts, as indicated in Table 1. Manual analysis was then completed on 1% of the binding sites in each subset, resulting in a total of 423 binding sites being manually checked at this step. Table 1 shows the ranges of numbers of contacts used to define the subsets, the number of binding sites in each subset, the number of binding sites that were checked manually, and the fraction of binding sites found to be biologically relevant within each range of number of contacts. In the range 50 to 100 contacts, the fraction of biologically relevant protein-ligand interactions is almost double that of the range 1 to 50, with 93 % of binding sites being biologically relevant. The fraction then slowly increases in higher ranges.

Table 1. Relation between fraction of biologically relevant interactions and number of protein-ligand inter-atomic contacts (ranges of 50 inter-atomic contacts)
Nb of contacts Nb of sites Nb checked Frelevant
1 to 50 6825 68 0.54
51 to 100 15326 153 0.93
101 to 150 9786 98 0.94
151 to 200 6531 65 0.98
> 200 3932 39 0.98

We then investigated in more detail the subset of binding sites with number of contacts between 30 and 80, dividing it in ranges of 10, to identify a more precise cutoff value above which the fraction of relevant interactions become larger than 90%. Table 2 shows the number of binding sites in each ranges, the number of binding sites checked manually (again 1%), and the fraction of binding sites found to be relevant for each range. The fraction of biologically relevant binding sites increases regularly with the number of contacts between protein and ligand atoms, and reaches a plateau value larger than 95% above 70 inter-atomic contacts.

Table 2. Relation between fraction of biologically relevant interactions and number of protein-ligand inter-atomic contacts (ranges of 10 inter-atomic contacts)
Nb of contacts Nb of sites Nb checked Frelevant
31 to 40 2145 21 0.10
41 to 50 2793 28 0.32
51 to 60 3287 33 0.64
61 to 70 3129 31 0.77
71 to 80 3293 33 0.97

Overall, about 600 binding sites were checked manually for their biological significance. This is more than for any set of binding sites ever investigated to date.

Selecting pairs of unbound/bound structures

A dataset used to benchmark binding site prediction methods should ideally consist of proteins with one unbound structure to apply the prediction method, and at least one bound structure to derive the reference definitions of known binding sites. This is necessary to account for the fact that proteins can undergo structural changes upon binding, and that consequently, applying a binding site prediction method to a bound structure from which the ligand is deleted does not reproduce appropriately situations where the binding site location is truly unknown.

We get the subset of unbound structures from the PDB by selecting for structures with at least one protein chain, no complex of different proteins (i.e. no heteromers), no small ligands, and no nucleic acid chains. We then remove redundancy from this subset using PISCES (6) with a sequence identity cutoff value of 25%. PDB entries consisting of Cα traces were excluded from this list, as well as non X-ray entries, and X-ray entries with a resolution larger than 2.4 Å or an R-value larger than 0.25. These quality criteria were imposed on the unbound structures to ensure that the benchmark could be used for validating binding site prediction methods which necessitate high quality structures.

The final benchmark is then produced by matching all proteins with an unbound structure with at least one bound structure with a biologically relevant ligand, derived as described above. Since the dataset relies on simple numeric cutoffs, it can be updated automatically when new structures are deposited in the PDB, and thus remain representative over time.

April 2012
Interdisciplinary Research Institute, Computational Biology, Villeneuve d'Ascq, France
University College London, Biomolecular Structure and Modelling Unit, London, UK
Hospital for Sick Children and University of Toronto, Structural Biology and Biochemistry Program, Toronto, Canada
Script execution time: 0.0202 seconds