Chapter 4. FAQ

Frequently Asked Questions

Q: How are scores computed?
Q: How can I obtain the complete data set?
Q: I am interested in retrieving data of a few particular interaction for my script. How do I go about to get it?
Q: How can I save a certain network?
Q: For my latest manuscript, I would like to use a picture in svg-format produced by STRING. Must I ask for permission?
Q: How can I trace the origin of the different evidences for an interaction?
Q: How to cite STRING?
Q: Which databases does STRING extract experimental data from?
Q: From which databases does STRING extract curated data?
Q: How do I extract purely experimental data?
Q: I want to extract PPI for a given species, but only from experimental data and not from transferred from other species.
Q: I want to differentiate physical interactions from functional ones within STRING
Q: It is stated that STRING is locus-based and only a single translated protein per locus is stored. What does this mean?
Q: Does STRING contain any Gene Ontology information? We see that there is a table called funcats. What type of information does this contain?
Q: Is there any phenotype information contained in STRING? More specifically, is there any field that specifies a phenotype or disease and links it to protein networks?
Q: Does the database give a PubMed Reference ID for each interaction?
Q: Are there different types of sets besides protein networks and pathways? What is the difference between a "set" and a "collection"?
Q: How do I access STRING using GI numbers. If it does, could you use 90 kD heat shock protein (GI:306891) as an example to let me know what should I type in protein name using NCBI GI number.
Q: Is there a legend or key for the different colored lines? (Is there a specific difference for each color?) (Is there a key for the colored lines in the evidence view?)
Q: I assume the arrows mean activation and the red perpendicular lines mean repression, but what to the circles at the end of the line represent?
Q: At each node, there are icons inside the protein spheres. Is there a key for these icons? Do the icons represent the different protein functions (DNA binding, enzyme, etc.)
Q: I want to download the data for a particular network that I have found while browsing the STRING web-interface
Q: I need all the interactions for a particular organism.
Q: How to extract high confidence (>0.7) interactions from information on "combined score" in "protein.links.v8.3.txt.gz"
Q: How to retrieve only the direct evidence in human, not transferred.
Q: In the file: "protein.links.v8.3.txt" are the scores multiplied by 1000?
Q: Are the colors assigned to nodes significant?
Q: Why are some nodes smaller and some nodes bigger?
Q: How to I map my proteins to STRING identifiers?
Q: Is there an automatic way of to mapping proteins to STRING? I need mappings for more three thousand proteins.
Q: The protein interactions from the STRING website via web API calls. What do the score columns mean (for example, nscore, fscore, tscore, etc) ?
Q: How do I select a reasonable score cut-off value for my analysis?
Q: How do I import several interactions from STRING into Cytoscape.
Q:

How are scores computed?

A:

The combined score are computed by combining the probabilities from the different evidence channels, correcting for the probability of randomly observing an interaction. For a more detailed description please see von Mering, et al Nucleic Acids Res. 2005

Q:

How can I obtain the complete data set?

A:

STRING is available for licensing - both for commercial and for academic institutions. Sign and send the academic license agreement (wait for approval) and download the SQL database.

Q:

I am interested in retrieving data of a few particular interaction for my script. How do I go about to get it?

A:

Use the API.

Q:

How can I save a certain network?

A:

In the Network Summary view, click save. Here you can save your network in a variety of formats. (Bitmap Image, Scalable Vector Graphics, XML Summary (Proteomics Standards Initiative), Graph Layout, Protein sequences in FASTA format, and Text Summary of interaction scores).

Q:

For my latest manuscript, I would like to use a picture in svg-format produced by STRING. Must I ask for permission?

A:

Nope. But we appreciate if you cite us: Jensen et al. Nucleic Acids Res. 2009,37(Database issue):D412-6

Q:

How can I trace the origin of the different evidences for an interaction?

A:

This information is available if you click on an edge of the graph in the network view.

Q:

How to cite STRING?

A:

Jensen et al. Nucleic Acids Res. 2009, 37(Database issue):D412-6

Q:

Which databases does STRING extract experimental data from?

A:

BIND, DIP, GRID, HPRD, IntAct, MINT, and PID.

Q:

From which databases does STRING extract curated data?

A:

Biocarta, BioCyc, GO, KEGG, and Reactome.

Q:

How do I extract purely experimental data?

A:

Uncheck all boxes, but the "Experiments" in the Info & Parameters box

Q:

I want to extract PPI for a given species, but only from experimental data and not from transferred from other species.

A:

You need to sign the license agreement to download the file: 'protein.links.full.v8.3.txt.gz'. Use the file to get the direct experimental evidence, for example by, printing the columns for protein1 protein2 and experiments (i.e., columns 1,2,10) and grep for the 'species_id' (e.g., 9606 for human).

 
          zgrep ^"9606\." protein.links.full.v8.3.txt.gz | awk '($10 != 0) { print $1, $2, $10 }' > ~/direct_experimental_data_human.txt
            

Q:

I want to differentiate physical interactions from functional ones within STRING

A:

To get this data you have to use the database dump. You can use the network.action table to get direct physical interaction from "mode" column. If this is "binding" the you can be sure there is a physical interactions. If not "binding" (i.e., everything else) may be either physical or functional.

 
          SELECT * FROM network.actions WHERE mode='binding';
            

Q:

It is stated that STRING is locus-based and only a single translated protein per locus is stored. What does this mean?

A:

STRING uses one protein per gene. If there is more than one iso-form per gene, we usually select the longest iso-form, unless we have information to suggest that another iso-form is better annotated (e.g., proteins in the CCDS database).

Q:

Does STRING contain any Gene Ontology information? We see that there is a table called funcats. What type of information does this contain?

A:

The "funcats" contain the functional categories as defined for the COG database. We import the GO complexes and use these for inferring interaction. GO terms themselves are projected for future version.

Q:

Is there any phenotype information contained in STRING? More specifically, is there any field that specifies a phenotype or disease and links it to protein networks?

A:

Not directly, but by searching for "wing" in Drosophila will return genes that have been annotated/described as such, each of which is associated with a network.

Q:

Does the database give a PubMed Reference ID for each interaction?

A:

Interactions that have only predicted evidence do not have an PMID. Text-mining evidence may also stem from other sources, such as OMIM. Apart from the above, interactions come with at least one pubmed reference id. Some cases have several different and others have the same pmid (e.g., for external repositories, the interaction have the pmid of the publication of the database).

Q:

Are there different types of sets besides protein networks and pathways? What is the difference between a "set" and a "collection"?

A:

The different types of sets are networks, pathways, complexes, and PDB structures with more than one protein. The "sets_items" are members in the evidence sets. An interaction exists if two lines have the same set_id. The "sets" contain information of the set_ids, for example, from which "collection" they originate from. The "collections" are the different resources of data from which STRING imports data (for the channels 'experiments' and 'databases').

Q:

How do I access STRING using GI numbers. If it does, could you use 90 kD heat shock protein (GI:306891) as an example to let me know what should I type in protein name using NCBI GI number.

A:

The GI accession numbers are to track sequence histories of GenBank. STRING does use these number nor does it keep track of them, mainly because STRING is locus based. Also, STRING imports its sequences from Ensemble and RefSeq. If you need to cross reference to a particular entry in STRING from a GenBank record, you use the accession id of the GenBank nucleotide record. For example, 90kDa heat shock protein in human, will be M16660, which will give you the following network:

 
          http://string-db.org/version_8_3/newstring_cgi/show_network_section.pl?identifier=9606.ENSP00000329390
          

Q:

Is there a legend or key for the different colored lines? (Is there a specific difference for each color?) (Is there a key for the colored lines in the evidence view?)

A:

Yes, there is a legend for the color of the lines. It can be read next to the table of "predicted functional partners". Green is for activation, red for inhibition, etc.

Q:

I assume the arrows mean activation and the red perpendicular lines mean repression, but what to the circles at the end of the line represent?

A:

If we know a directionality of the action is indicated by the symbol at the end of the edge next to the protein that is acted upon. Down-Regulation is a red bar and up-regulation is a green arrow, as you say. Yellow circle is describes that we know the directionality of the interaction e.g. ("A" acts upon "B"), but we do not know the if the result of the interaction (e.g., if it is up- or down-regulated).

Q:

At each node, there are icons inside the protein spheres. Is there a key for these icons? Do the icons represent the different protein functions (DNA binding, enzyme, etc.)

A:

The icons do not have any particular meaning other than that there is a structure associated with them. This can be either a PDB entry for the protein itself or a close homolog. If no PDB entry exists we look if their structure available by homology modeling from swiss-model. A small bubble (without icon) means that there is no structural information available.

Q:

I want to download the data for a particular network that I have found while browsing the STRING web-interface

A:

You can save this data by clicking "save" in the navigation bar directly under the network. This will take you to a page where you can chose to download your data in a number of formats. The simplest to use is probably "Text Summary (TXT - simple tab delimited flatfile)".

Q:

I need all the interactions for a particular organism.

A:

You can download all data from the download section (http://string-db.org/newstring_cgi/show_download_page.pl). Here you can download the file "protein.links.v8.2.txt.gz" which you can parse using the taxon_id of you organism of interest (can be found at http://www.uniprot.org/taxonomy/). Note that you have to use a strain that exists in STRING (e.g., 184922). Now you can just grep for your organism of interest (assuming you are using {un,lin}ux or mac).

 
          zgrep ^"184922\." protein.links.v8.2.txt.gz
          

The first two columns are the identifiers of the two interactors and the third is the confidence score multiplied by 1000.

Q:

How to extract high confidence (>0.7) interactions from information on "combined score" in "protein.links.v8.3.txt.gz"

A:

Here you can simply use awk to condition on the third column that contains the combined_score. Note that the scores are multiplied by 1000 to make them integers. I also assume that you only want evidence from human. Try the following:

 
          zgrep ^"9606\." protein.links.v8.2.txt.gz | awk '($3 > 700) {print}'
          

Q:

How to retrieve only the direct evidence in human, not transferred.

A:

You need the file: "protein.links.full.v8.3.txt.gz", from which you can retrieve the columns like above and write it to a file.

 
          zgrep ^"9606\." protein.links.full.v8.3.txt.gz  | awk '($16 > 700) { print $1, $2, $3, $5, $6, $7, $8, $10, $12, $14, $16 }' > PPI_700_human.txt
          

The first and the second columns contains the STRING external identifiers. The last column contains the integrated scores including the homology transferred evidence.

Q:

In the file: "protein.links.v8.3.txt" are the scores multiplied by 1000?

A:

Yes, the scores are multiplies by a factor 1000 (and truncated). 872 in the file means a STRING score of 0.872

Q:

Are the colors assigned to nodes significant?

A:

There is no particular meaning of the node color. They are used as a visual aid to identify which node goes with which description in list of input proteins and interactors below the network.

Q:

Why are some nodes smaller and some nodes bigger?

A:

The different size of the node only reflects that there is structural information associated with the protein. (i.e., it is larger to fit the thumbnail picture).

Q:

How to I map my proteins to STRING identifiers?

A:

You can use the file of protein aliases available from the download page protein.aliases.v8.3.txt.gz. This file has four columns: species_ncbi_taxon_id, protein_id, alias, source. To figure out which is the string identifier for trpA in E. coli K12, you can do something like this in you terminal:

 
          zgrep ^83333 protein.aliases.v8.3.txt.gz | grep trpB
          

which would return:

 
          83333	b1261	trpB	BLAST_UniProt_GN RefSeq
          

from this you can get the string name by concatenating the two first column with a period (83333.b1261)

Q:

Is there an automatic way of to mapping proteins to STRING? I need mappings for more three thousand proteins.

A:

A convenient way of mapping your proteins to STRING entries is to use the STRING API. As an example, for a single protein, the alias can be retrieved by:

 http://string-db.org/api/tsv/resolve?identifier=trpA\&species=83333

Alternatively, instead of making on call per protein you can try to all the identifiers for a list of protein (separated by '%0D'):

 http://string-db.org/api/tsv/resolveList?identifiers=trpA%0DtrpB\&species=83333

In such cases you may have a problems with the length limit of the URL, but this can be circumvented by sending the request as a HTTP POST request. For example using cURL:

 
          curl -d "identifiers=trpA%0DtrpC%0DtrpB%0DtrpD\&species=83333" string-db.org/api/tsv/resolveList
          

Q:

The protein interactions from the STRING website via web API calls. What do the score columns mean (for example, nscore, fscore, tscore, etc) ?

A:

Here is a summary.

nscore - neighborhood score, (computed from the inter-gene nucleotide count).

fscore - fusion score (derived from fused proteins in other species).

pscore - cooccurence score of the phyletic profile (derived from similar absence/presence patterns of genes).

hscore - homology score, the degree of homology of the interactors (trivial and normally not reported in STRING).

ascore - coexpression score (derived from similar pattern of mRNA expression measured by DNA arrays and similar technologies).

escore - experimental score (derived from experimental data, such as, affinity chromatography).

dscore - database score (derived from curated data of various databases).

tscore - textmining score (derived from the co-occurrence of gene/protein names in abstracts).

Q:

How do I select a reasonable score cut-off value for my analysis?

A:

You can use the score cut-off to limit the number of interactions to those that have higher confidence and are more likely to be true positives. Setting the cutoff lower, will increase coverage but also the fraction of false positives. You have to choose some arbitrary number based on the number of interactions you need for you analysis.

Q:

How do I import several interactions from STRING into Cytoscape.

A:

Cytoscape supports "tab separated values" file format. Download the "protein.links file" (from STRING download page), extract the interactions for you want (use grep or copy-paste), and load the processed file into cytoscape.