SimGenex

From Transsyswiki
Jump to navigationJump to search

Introduction

Gene expression measurements are measurements that determine the amount of product of one or more genes in a biological sample. The amount or concentration of a gene product is called the expression level of the gene that encodes the product. Samples for gene expression measurement are typically cultivated at controlled conditions. While the exact conditions depend on the object of research and the specific research question, the properties that are subject to control can generally be classified into genetic properties and environmental conditions. The set of expression levels of a given gene, measured in different samples, is called the expression profile (or profile, for short) of that gene. The set of expression levels of all genes in all samples is called an expression set, or, in recognition of the “genes × conditions” format of the set, an expression matrix. Genetic properties pertain to the genetic makeup of the subjects. Specifically, genes may be knocked out (loss of function mutations), or they may be overexpressed (gain of function mutations). There is a wide range of environmental conditions that biological subjects may be exposed to. A frequent condition is treatment with some agent, such as a hormone, drug, or other effector. Gene expression levels that have been measured are subjected to various mathematical operations. It is common practice to work in the logarithmic domain (i.e. to take the logarithm of the raw expression levels), because upand down-regulation can be directly compared with such “logarithmised” values. Gene expression measurement can sometimes produce negative values as an artifact. This must be addressed before values are transformed to the logarithmic domain. Adding a small offset is a simple remedy of this problem. Once gene expression levels are adequately conditioned, expression profiles can be compared. Quantitatively, comparison takes place by defining a distance measure that quantifies how dissimilar two profiles are. Two straightforward distance measures are the Euclidean distance and the correlation distance (which is a semi-metric distance), defined as 1 − r(g1, g2), where r(g1, g2) denotes correlation coefficient between the expression profiles of genes g1 and g2. The sum of distances of expression profiles is a distance between two expression sets. The transsys framework provides a basis for simulating regulatory networks with different genetic properties, and for deriving loss or gain of function variants of a given regulatory network by removing or adding genes, respectively. Different environmental conditions can be simulated by designating factors that are subject to external alteration, and using different settings of the the expression levels of these factors to simulate different conditions. The language defined here is designed to enable succinct and flexible specification of such biological processes and experimental procedures in silico that result in a simulated expressionmatrix, and also to specify a distancemeasure to compare the simulated matrix to an target matrix comprised of expression data that is externally provided (i.e. not generated by way of simulation). The target matrix is also called the empirical matrix. In addition to this specification, a transsys program, called the candidate program, that models the regulatory network is required to carry out the simulation. Candidate programs must satisfy certain criteria in order to be suitable for simulation according to a specification. Specifically, the transsys program needs to have factors and genes that are specified by name in the simulation protocol specification. Within these requirements, candidate programs can be freely chosen.

Objective Function Specification

Language Structure

simgenex_def ::= procedure_defs simexpression_defs measurementmatrix_def discriminationsettings_def

The core of a SimGenex program describes how to use a transsys GRN model to produce a simulated gene expression matrix from the definition of a set simexpressions of primary operations that are sufficiently general to simulate most standard experimental procedures. The measurementmatrix block describes how to transform the primary simulated matrix into a measurement matrix by e.g.\ computing log-ratios. Finally, the discriminationsettings block configures computation of the distance of the measurement matrix to a target matrix.

Procedures

procedure_defs ::= (procedure_def)+

procedure_def ::= procedure_header procedure_body procedure_footer NEWLINE

procedure_header ::= "procedure" identifier NEWLINE
procedure_body ::= (instruction)+
instruction ::= procedure_identifier | primary_instruction

primary_instruction ::= "knockout:" identifier NEWLINE
| "runtimesteps:" integer NEWLINE
| "treatment:" identifier "=" realnumber NEWLINE
| "overexpress:" identifier "=" realnumber NEWLINE
procedure_footer ::= "endprocedure" NEWLINE
procedure_identifier ::= identifier

A procedure specifies a sequence of operations to be performed on a transsys instance. Operations are specified either by primary instructions or by other procedures. Primary instructions specify elementary operations that the simulator knows to perform. These are: • runtimesteps runs the specified number of time steps to create a new transsys instance. • knockout removes the specified gene from the transsys program. The identifiermust be the name of a gene in the candidate program. The knockout affects gene expression simulation (via the runtimesteps instruction) issued subsequently to the knockout instruction. Notice that the knockout operation modifies the candidate program. • treatment takes the name of a factor and a value that the expression level of the factor is to be set to. This operation is applied to the current transsys instance, overwriting the previous expression level of the factor. Subsequently, the expression dynamics of the factor will be determined by the candidate program. • overexpress inserts a new gene into the candidate program. The identifier is the name of an existing gene in the candidate program. The new gene encodes the same product as the specified existing gene, and has a promoter comprised of one constitutive element, expressing the gene at the specified rate. An identifier in a procedure body identifies another procedure to be invoked. Invoking another procedure results in execution of the instruction in the other procedure’s body. By recursively applying this rule, a procedure ultimately reduces to a sequence of primary instructions. It is an error for a procedure to refer to itself, or to any procedure that eventually invokes itself, as infinite recursion would occur in this case.

It is an error if an identifier does not reference an existing procedure. Procedures may be listed in any order, so it is legal to reference a procedure before it is defined.

Simulating Gene Expression

simexpression_defs ::= (simexpression_def)+

simexpression_def ::= simexpression_header simexpression_body simexpression_footer NEWLINE

simexpression_header ::= "simexpression" identifier NEWLINE
simexpression_body ::= (identifier NEWLINE)+
simexpression_footer ::= "endsimexpression" NEWLINE

Simulations of gene expression, or ``simexpressions for short, describe a simulation procedure to produce a transsys instance. The idea is that the simulation procedure models the genetic makeup and the relevant conditions and (possibly) experimental manipulations experienced by a biological object. If the candidate program is a good model of the gene regulatory network in the biological object, the expression levels in the transsys instance are expected to be similar to those measured in the biological object.

Like procedures, simexpressions may be composed of primary instructions and procedure invocations. In addition, they also may contain foreach instructions. Such simexpressions define multiple columns in the simulated matrix. The foreach instruction enables very compact specifications of setups in which a number of strains are subjected to the same set of experimental conditions. For example, the declaration:

 simexpression s
 {
   foreach: wildtype komutant;
   equilibration;
   foreach mock real;
   onehour;
 }

specifies four columns in which the genotypes wildtype and komutant are subjected to mock and the real treatment. The procedures komutant, mock and rea} have to be defined in order for the above code fragment to work.

The sequence of identifiers in the body of a simexpression is resolved to a sequence of primary instructions, as described for procedures in section section procedures. The sequence of primary operations is applied to a transsys instance of the unmodified candidate program, with all expression levels starting at 0. (Note: Future extensions may provide mechanisms for specifying the initial state of the instance.)

Identifiers in a simexpression body must identify procedures. Invocation of other simexpressions is an error.


Computing the Simulated Matrix

 measurementmatrix_def ::= "measurementmatrix" 
   "{" measurementprocess_def measurementcolums_def "}"

As in the wet lab scenario, the columns of a matrix simulated by SimGenex need to be transformed following the same protocols that were applied to compute the target matrix of empirical data. SimGenex uses the following blocks within the measurementmatrix section to specify such procedures: • measurementprocess: specifies how individual gene expression values are normalised. offset how expression values are transformed to simulate a column in a gene expression matrix transformation • measurementcolumns: specifies the columns in the simulated expression matrix. Columns are computed by subjecting the expression levels in one or more simexpressions to mathematical operations, resulting in a column containing one value for each mapped factor of the candidate program. The idea is that the mathematical operations should be the same as those applied to the raw empirical data that have resulted in the empirical expression matrix.

Discrimination Settings and Gene mapping

 discriminationsettings_def ::= "discriminationdettings"
 "{" genemapping distance_def whitelist_def "}"

 genemapping_def ::= genemapping_header genemapping_body genemapping_footer
 genemapping_header ::= "genemapping"
 genemapping_body ::= (factor_def )+
 genemapping_footer ::= "endgenemapping"
 factor_def ::= "factor" identifier "=" (gene_manufacturer_identifier)+

genemapping specifies names of genes in the computational model that can be mapped to names in the target matrix, which may e.g. be IDs designated by the microarray provider. If there are multiple profiles in the empirical matrix that correspond to one factor, these may be specified as a whitespace separated list. In this case, the average profile is used for comparison. The identifiers of the right hand side must be names of factors in the candidate program. No factor may be mapped more than once.

 distancemeasure_def ::= "distance:" distance_type
 distance_type ::= "correlation" | "euclidean" | "sum_squares"

SimGenex allows the specification of a distance measure to compare the simulated matrix to a target matrix which can e.g. be used to discriminate the best GRN model from among a number of candidates. In addition, as in the real scenario, SimGenex allows the specification of a mapping scheme.

 whitelist_def ::= whitelist_header whitelist_body whitelist_footer
 whitelist_header ::= "whitelistdefs"
 whitelist_body ::= whitelist_factor_def whitelist_gene_def
 whitelist_footer ::= "endwhitelistdefs"
 whitelist_factor_def ::= "factor" ":" (identifier)+
 whitelist_gene_def ::= "gene: " (identifier)+
 whitelist specifies which factors or genes a discriminator may adjust. This feature is useful where parts of

the GRN model are unknown, and the discriminator should therefore explore various alternatives for the unknown parts. As an example, where numerical parameters are unknown, these can be set by numerical optimisation. WhiteList terms are those factors or genes, in a transsys program, whose initial conditions must not be changed by the optimiser when performing model selection. There are a number of reasons why this option is available,

one of them is the use of mathematical expressions rather than single values to specify under which conditions a biochemical reaction might take place.

Terminal Tokens (Lexical Structure)

SimGenex ::= "SimGenex-0.1" NEWLINE

NEWLINE

gene_manufacturer_identifier ::= doublequote character_but_not_doublequote+ doublequote
realnumber ::= digit_sequence "." unsigned_digit_sequence+ scale_factor+
| digit_sequence scale_factor
digit_sequence ::= sign+ unsigned_digit_sequence
unsigned_digit_sequence ::= digit+
scale_factor ::= ("E" | "e") digit_sequence
sign ::= "+" | "-"
digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
identifier ::= (letter | "_") (letter | digit | "_")+
letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K"| "L"
| "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W"| "X" | "Y"
| "Z"| "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k"| "l"
| "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
doublequote ::= the doublequote character (chr(34))

References

[1] Helen Parkinson, Misha Kapushesky, Nikolay Kolesnikov, Gabriella Rustici, Mohammad Shojatalab, Niran Abeygunawardena, Hugo Berube, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Ele Holloway, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Tim F. Rayner, Faisal Rezwan, Anjan Sharma, Eleanor Williams, Xiangqun Zheng Bradley, Tomasz Adamusiak, Marco Brandizi, Tony Burdett, Richard Coulson, Maria Krestyaninova, Pavel Kurnosov, Eamonn Maguire, Sudeshna Guha Neogi, Philippe Rocca- Serra, Susanna-Assunta Sansone, Nataliya Sklyar, Mengyao Zhao, Ugis Sarkans, and Alvis Brazma. Arrayexpress update: from an archive of functional genomics experiments to the atlas of gene expression. Nucl. Acids Res., 37:868–872, 2009.

Future Perspectives

Optimisation by Reusing Prefix Instruction Sequences

Simexpressions reduce to sequences of primary instructions. If two simexpressions share a prefix (i.e. they start with the same sequence of instructions) and the candidate program is deterministic (i.e. it does not use any of the random number generation functions provided by transsys), the two simexpressions that share a prefix sequence can be computed by first computing the prefix sequence and then using that as a starting point for computing both the first and the second simexpression. Depending on the execution time of the prefix, this can be a significant optimisation (e.g. where many time steps are used for equilibration of a transsys instance). As a special case, one simexpression may be a prefix of another. When a set of simexpressions simulates a time series, they will form nested prefixes. To fully exploit the optimisation potential especially for time series as described above, an optimiser must be able to break up a runtimesteps t instruction into commands runtimesteps n1 and runtimesteps n2, where n1 + n2 = n, as appropriate. It is important to notice that the validity of this optimisation depends on the assumption that the candidate program is deterministic. This may not hold, therefore simulating all simexpressions independently must be retained. If the transsys program object provided a method to detect whether the program is deterministic, simulators could use that method to automatically activate the prefix optimisation as appropriate.

Toy problem

Toy transsys program

transsys b
{
 factor f 
 { 
   decay: 0.1; 
   diffusibility: 0.1; 
 }
 
 gene g
 {
   promoter
   {
     f: activate(1, 10);
   }
   product
   {
     default: f;
   }
 } 
}

Toy Simgenex spec

ObjectiveSpecification-0.1


globalsettingdefs
transformation: log
distance: correlation
offset: 1e-10
endglobalsettingdefs

whitelistdefs
factor: "f"
gene: "g"
endwhitelistdefs

genemapping
factor f = "257221_at" "257222_at" "257223_at"
endgenemapping

procedure equilibration
runtimesteps: 100
endprocedure

procedure ko_g
knockout: g
endprocedure

simexpression wt
equilibration
endsimexpression

simexpression g
ko_g
equilibration
endsimexpression

arraymapping
array gko : g / wt
endarraymapping

Toy empirical expression data

	wt	g
f	3.314894e-56	6.537199e-01

SimGenex Web

simGenex