Chapter 1.  Getting started

Starting Point

To specify your desired starting point of the analysis you have to use the input form at the STRING start page (depicted above). You can enter your protein of interest by supplying its name or identifier. Alternatively, clicking on the other tabs, you can search by amino acid sequence (in any format), multiple names or multiple sequences. There are also 3 example inputs and a random input generator which will randomly select a protein with at least 4 predicted links at medium confidence or better.

The organism can be selected by clicking on the arrow or directly typing the name inside the relative input field (an autocompletion mechanism will appear to help you). General names that group more than one organism (e.g. "mammals") can also be used.

Prediction Summary

If predicted associations for your protein are found, they are displayed in a summary view, located just below the view of the network. At the top of the summary your input is shown. If your input gene is a fusion of two functions, both will be shown. Predicted associations are shown immediately below your input, sorted by score. Clicking on the score bullets gives you a breakdown of the individual prediction method scores. Clicking on a gene name gives you the protein sequence as well as a list of similar proteins in STRING. Initially, only predictions with medium (or better) confidence, limited to the top 10 will be shown. These settings can be changed by using the parameter dialog box at the bottom of the page.

Navigation buttons

The navigation (depicted above) take you to different 'views' of the data, allowing you to see the different types of evidence that supports the predicted associations.

Network View

The network view summarizes the network of predicted associations for a particular group of proteins. The network nodes are proteins. Hovering over a node will display its annotation, clicking on a node gives several details about the protein. The edges represent the predicted functional associations. An edge may be drawn with up to 7 differently colored lines - these lines represent the existence of the seven types of evidence used in predicting the associations. A red line indicates the presence of fusion evidence; a green line - neighborhood evidence; a blue line - coocurrence evidence; a purple line - experimental evidence; a yellow line - textmining evidence; a light blue line - database evidence; a black line - coexpression evidence. Hovering over an edge will display the combined association score, clicking on it gives you the detailed evidence breakdown.

Conserved Neighborhood View

The neighborhood view shows runs of genes that occur repeatedly in close neighborhood in (prokaryotic) genomes. Genes located together in a run are linked with a black line (maximum allowed intergenic distance is 300 base pairs). Note that if there are multiple runs for a given species, these are separated by white space. If there are other genes in the run that are below the current score threshold, they are drawn as small white triangles. Gene fusion occurrences are also drawn, but only if they are present in a run (see also the Fusion section below for more details).

Co-occurrence

The occurrence view shows the presence or absence of linked proteins across species. Proteins are listed across the top of the page and a phylogenetic tree with species names is listed down the left hand side. In the subsequent grid, the presence of the protein in a species is marked with a red square and absence with a white space. The intensity of the color of the red square reflect the amount of conservation of the homologous protein in the specie.

Fusion View

The fusion view shows the individual gene fusion events per species. The species in which fusion occurs are listed to the left. Genes are colored according to the table at the bottom of the page. White genes are those which are fused but not directly linked to the input at the selected confidence level. Hovering above a region in a gene gives the gene name; clicking on a gene gives more detailed information.

Co-expression View

The coexpression view shows the genes that are co-expressed in the same or in other species (transferred by homology). Co-expression is shown by a red square: more intense color of the square represent a higher association score of the expression data.

Experiments View

The experiments view shows a list of significant protein interaction datasets, gathered from other protein-protein interaction databases. The name of the database is present in the grey header of the table: you can get more information on the group, clicking on the "info" link. Below the header, the organism is reported together with the proteins of the network that are present in this group.

Databases View

This view shows a list of significant protein interaction groups, gathered from curated databases. You can get more information on the group, clicking on the "info" link on the grey rows. Clicking the bubbles next to their respective gene names give information of the individual proteins.

Text Mining View

The text mining view shows a list of significant protein interaction groups, extracted from the abstracts of scientific literature. The title and the abstract of the publication are displayed together with a link to the publication.

Info/Parameter Dialogs

The dialog box at the bottom explains briefly what is being shown and allows you to change parameters that influence the output. Each 'view' has a designated set of parameters.

The first parameters are the same for all views: Your input identifier, your requested minimum confidence and an option to limit the output to the 10 best-scoring hits. The confidence score is the approximate probability that a predicted link exists between two enzymes in the same metabolic map in the KEGG database. Confidence limits are as follows: low confidence - 20% (or better), medium confidence - 50%, high confidence - 75%, highest confidence - 95%. Please note that parameters are only changed when you press the 'Update Parameters' button.

The dialogue box shown above is the one for the Network View. Network specific parameters are: 'edge scaling factor' - this reduces the length of high-scoring edges so that the images will be drawn more compact, and low scoring hits will be spread out further. Lower values mean more compact images, higher values will cause more spread. The second parameter is the 'network depth'. A value higher than 1 means that the search for interactions is iterative - after a first round all nodes are themselves again input for a next round of searches. Nodes of a higher iteration will be colored white. Please note that this can result in fairly large images that may take a while to compute and download. This feature allows you to 'walk' through the network of functional associations. Note that you can click on any node, and the subsequent page offers a link to use that node as the input - effectively placing it in the center of the image. Repeated use of this mechanism allows you to explore large regions of the network.

A note on the network drawing algorithm

STRING uses a spring model to generate the network images. Nodes are modeled as masses and edges as springs; the final position of the nodes in the image is computed by minimizing the 'energy' of the system. We give high confidence edges a higher 'spring strength' so that they will reach an optimal position before lower confidence edges. The user also can optionally reduce the 'natural length' of a high confidence edge - this forces them closer together and sometimes results in a clearer picture of high confidence interactions. We set the high confidence edge length to 80% of the normal length by default.

This modeling has some important consequences that the user should be aware of. Firstly, the physical distances between two nodes along an edge in a graph has no meaning; indeed, an attempt to set the edge length based on score would probably result in an unsolvable set of equations! We try to ensure high confidence links are drawn close together through the setting of the modeling parameters described above. Secondly, although the algorithm is deterministic - the same input will produce the same output - the addition of, say, new nodes to the network can result in node locations in the new image completely changing. Finally, although the input node is the 'center' of the network in an abstract sense, it may not be located centrally in the network image.