SimGenex

From Transsyswiki
Jump to navigationJump to search

Introduction

Gene expression measurements are measurements that determine the amount of product of one or more genes in a biological sample. The amount or concentration of a gene product is called the expression level of the gene that encodes the product. Samples for gene expression measurement are typically cultivated at controlled conditions. While the exact conditions depend on the object of research and the specific research question, the properties that are subject to control can generally be classified into genetic properties and environmental conditions. The set of expression levels of a given gene, measured in different samples, is called the expression profile (or profile, for short) of that gene. The set of expression levels of all genes in all samples is called an expression set, or, in recognition of the “genes × conditions” format of the set, an expression matrix. Genetic properties pertain to the genetic makeup of the subjects. Specifically, genes may be knocked out (loss of function mutations), or they may be overexpressed (gain of function mutations). There is a wide range of environmental conditions that biological subjects may be exposed to. A frequent condition is treatment with some agent, such as a hormone, drug, or other effector. Gene expression levels that have been measured are subjected to various mathematical operations. It is common practice to work in the logarithmic domain (i.e. to take the logarithm of the raw expression levels), because upand down-regulation can be directly compared with such “logarithmised” values. Gene expression measurement can sometimes produce negative values as an artifact. This must be addressed before values are transformed to the logarithmic domain. Adding a small offset is a simple remedy of this problem. Once gene expression levels are adequately conditioned, expression profiles can be compared. Quantitatively, comparison takes place by defining a distance measure that quantifies how dissimilar two profiles are. Two straightforward distance measures are the Euclidean distance and the correlation distance (which is a semi-metric distance), defined as 1 − r(g1, g2), where r(g1, g2) denotes correlation coefficient between the expression profiles of genes g1 and g2. The sum of distances of expression profiles is a distance between two expression sets. The transsys framework provides a basis for simulating regulatory networks with different genetic properties, and for deriving loss or gain of function variants of a given regulatory network by removing or adding genes, respectively. Different environmental conditions can be simulated by designating factors that are subject to external alteration, and using different settings of the the expression levels of these factors to simulate different conditions. The language defined here is designed to enable succinct and flexible specification of such biological processes and experimental procedures in silico that result in a simulated expressionmatrix, and also to specify a distancemeasure to compare the simulated matrix to an target matrix comprised of expression data that is externally provided (i.e. not generated by way of simulation). The target matrix is also called the empirical matrix. In addition to this specification, a transsys program, called the candidate program, that models the regulatory network is required to carry out the simulation. Candidate programs must satisfy certain criteria in order to be suitable for simulation according to a specification. Specifically, the transsys program needs to have factors and genes that are specified by name in the simulation protocol specification. Within these requirements, candidate programs can be freely chosen.

Objective Function Specification

Language Structure

The Top Level

objectivespec ::= knockout_treatment_magic knockout_treatment_objectivespec

The knockout_treatment_magic is a string that identifies the file as an objective function specification. The knockout_treatment_objectivespec contains the objective function specification.

knockout_treatment_objectivespec ::= globalsettings genemapping_def procedure_defs simexpression_defs arraymapping_def

An objective function specification is comprised of a set of global settings, a gene mapping, procedures, definitions of simulated gene expression samples, and finally specifications of how simulations of the empirical data, called arrays, are computed from the simulated gene expression samples.

Global Settings

globalsettings ::= globalsettings_header globalsettings_body globalsettings_footer NEWLINE

globalsettings_header ::= "globasettingdefs" NEWLINE
globalsettings_body ::= transformation_def distancemeasure_def offset_def
globalsettings_footer ::= "endglobalsettingdefs" NEWLINE

transformation_def ::= "transformation:" transformation_type NEWLINE
transformation_type ::= "log" | "none"
distancemeasure_def ::= "distance:" distance_type NEWLINE
distance_type ::= "correlation" | "euclidean" | "sum_squares"
offset_def ::= "offset:" realnumber | "offset:" realnumber "data_sd"

The globalsettings contain the parameters

• transformation sets the transformation of the expression levels. The log transformation applies the log2 function to expression values. • distance selects the method for computing distances between expression profiles. • offset specifies an offset to be added to expression levels. The offset can either be specified as an absolute value, or as a multiple of the standard deviation of the set of all expression levels.

Gene mapping

genemapping_def ::= genemapping_header genemapping_body genemapping_footer NEWLINE

genemapping_header ::= "genemapping" NEWLINE
genemapping_body ::= (factor_def NEWLINE)+
genemapping_footer ::= "endgenemapping" NEWLINE

factor_def ::= "factor" identifier "=" (gene_manufacturer_identifier)+ NEWLINE

The gene mapping maps transsys factor names (notice: not gene names) to identifiers that label the profiles in the empirical matrix, (the so-called gene manfacturer identifiers). The identifiers of the right hand side must be names of factors in the candidate program. No factor may be mapped more than once. If there are multiple profiles in the empirical matrix that correspond to one factor, these may be specified as a whitespace separated list. In this case, the average profile is used for comparison.

Procedures

procedure_defs ::= (procedure_def)+

procedure_def ::= procedure_header procedure_body procedure_footer NEWLINE

procedure_header ::= "procedure" identifier NEWLINE
procedure_body ::= (instruction)+
instruction ::= procedure_identifier | primary_instruction

primary_instruction ::= "knockout:" identifier NEWLINE
| "runtimesteps:" integer NEWLINE
| "treatment:" identifier "=" realnumber NEWLINE
| "overexpress:" identifier "=" realnumber NEWLINE
procedure_footer ::= "endprocedure" NEWLINE
procedure_identifier ::= identifier

A procedure specifies a sequence of operations to be performed on a transsys instance. Operations are specified either by primary instructions or by other procedures. Primary instructions specify elementary operations that the simulator knows to perform. These are: • runtimesteps runs the specified number of time steps to create a new transsys instance. • knockout removes the specified gene from the transsys program. The identifiermust be the name of a gene in the candidate program. The knockout affects gene expression simulation (via the runtimesteps instruction) issued subsequently to the knockout instruction. Notice that the knockout operation modifies the candidate program. • treatment takes the name of a factor and a value that the expression level of the factor is to be set to. This operation is applied to the current transsys instance, overwriting the previous expression level of the factor. Subsequently, the expression dynamics of the factor will be determined by the candidate program. • overexpress inserts a new gene into the candidate program. The identifier is the name of an existing gene in the candidate program. The new gene encodes the same product as the specified existing gene, and has a promoter comprised of one constitutive element, expressing the gene at the specified rate. An identifier in a procedure body identifies another procedure to be invoked. Invoking another procedure results in execution of the instruction in the other procedure’s body. By recursively applying this rule, a procedure ultimately reduces to a sequence of primary instructions. It is an error for a procedure to refer to itself, or to any procedure that eventually invokes itself, as infinite recursion would occur in this case.

It is an error if an identifier does not reference an existing procedure. Procedures may be listed in any order, so it is legal to reference a procedure before it is defined.

Gene Expression Sample Simulation

simexpression_defs ::= (simexpression_def)+

simexpression_def ::= simexpression_header simexpression_body simexpression_footer NEWLINE

simexpression_header ::= "simexpression" identifier NEWLINE
simexpression_body ::= (identifier NEWLINE)+
simexpression_footer ::= "endsimexpression" NEWLINE

Simulations of gene expression, or “simexpressions” for short, describe a simulation procedure to produce a transsys instance. The idea is that the simulation procedure models the genetic makeup and the relevant conditions and (possibly) experimental manipulations experienced by a biological object. If the candidate program is a good model of the gene regulatory network in the biological object, the expression levels in the transsys instance are expected to be similar to those measured in the biological object. The sequence of identifiers in the body of a simexpression is resolved to a sequence of primary instructions, as described for procedures in section 2.1.3. The sequence of primary operations is applied to a transsys instance of the unmodified candidate program, with all expression levels starting at 0. (Note: Future extensions may provide mechanisms for specifying the initial state of the instance.) Identifiers in a simexpression body must identify procedures. Invocation of other simexpressions is an error.

Arrays

arraymapping_def ::= arraymapping_header arraymapping_body arraymapping_footer

arraymapping_header ::= "arraymapping" NEWLINE
arraymapping_body ::= (arraymapping_spec)+
# identifier following "array" identifies an array in the empirical data set
# identifier following ":" identifies a simexpression
# identifier following "/" identifies a simexpression to be used as a ratio reference
arraymapping_spec ::= "array" identifier ":" identifier ( "/" identifier)? NEWLINE
arraymapping_footer ::= "endarraymapping" NEWLINE

Arrays are columns in the simulated expression matrix. They are computed by subjecting the expression levels in one or more simexpressions to mathematical operations, resulting in a column containing one value for each mapped factor of the candidate program. The idea is that the mathematical operations should be the same as those applied to the raw empirical data that have resulted in the empirical expression matrix. At this time, the only operations are copying a simexpression or computing the element-wise ratio of two simexpressions. The latter serves to realise logratios. Array mapping operations are applied to values that have already been subjected to offset adding and log transformation [1]

WhiteList

whitelist ::= whitelist_header whitelist_body whitelist_footer NEWLINE

whitelist_header ::= "whitelistdefs" NEWLINE
whitelist_body ::= (whitelist_instruction)+
whitelist_footer ::= "endwhitelistdefs" NEWLINE
whitelist_instruction ::= "factor: " (identifier)+ | "gene: " (identifier)+

WhiteList terms are those factors or genes, in a transsys program, whose initial conditions must not be changed by the optimiser when performing model selection. There are a number of reasons why this option is available, one of them is the use of mathematical expressions rather than single values to specify under which conditions a biochemical reaction might take place.

Terminal Tokens (Lexical Structure)

knockout_treatment_magic ::= "KnockoutTreatmentObjectiveSpecification-0.1" NEWLINE NEWLINE

gene_manufacturer_identifier ::= doublequote character_but_not_doublequote+ doublequote
realnumber ::= digit_sequence "." unsigned_digit_sequence+ scale_factor+
| digit_sequence scale_factor
digit_sequence ::= sign+ unsigned_digit_sequence
unsigned_digit_sequence ::= digit+
scale_factor ::= ("E" | "e") digit_sequence
sign ::= "+" | "-"
digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
identifier ::= (letter | "_") (letter | digit | "_")+
letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K"| "L"
| "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W"| "X" | "Y"
| "Z"| "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k"| "l"
| "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
doublequote ::= the doublequote character (chr(34))

References

[1] Helen Parkinson, Misha Kapushesky, Nikolay Kolesnikov, Gabriella Rustici, Mohammad Shojatalab, Niran Abeygunawardena, Hugo Berube, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Ele Holloway, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Tim F. Rayner, Faisal Rezwan, Anjan Sharma, Eleanor Williams, Xiangqun Zheng Bradley, Tomasz Adamusiak, Marco Brandizi, Tony Burdett, Richard Coulson, Maria Krestyaninova, Pavel Kurnosov, Eamonn Maguire, Sudeshna Guha Neogi, Philippe Rocca- Serra, Susanna-Assunta Sansone, Nataliya Sklyar, Mengyao Zhao, Ugis Sarkans, and Alvis Brazma. Arrayexpress update: from an archive of functional genomics experiments to the atlas of gene expression. Nucl. Acids Res., 37:868–872, 2009.

Future Perspectives

Optimisation by Reusing Prefix Instruction Sequences

Simexpressions reduce to sequences of primary instructions. If two simexpressions share a prefix (i.e. they start with the same sequence of instructions) and the candidate program is deterministic (i.e. it does not use any of the random number generation functions provided by transsys), the two simexpressions that share a prefix sequence can be computed by first computing the prefix sequence and then using that as a starting point for computing both the first and the second simexpression. Depending on the execution time of the prefix, this can be a significant optimisation (e.g. where many time steps are used for equilibration of a transsys instance). As a special case, one simexpression may be a prefix of another. When a set of simexpressions simulates a time series, they will form nested prefixes. To fully exploit the optimisation potential especially for time series as described above, an optimiser must be able to break up a runtimesteps t instruction into commands runtimesteps n1 and runtimesteps n2, where n1 + n2 = n, as appropriate. It is important to notice that the validity of this optimisation depends on the assumption that the candidate program is deterministic. This may not hold, therefore simulating all simexpressions independently must be retained. If the transsys program object provided a method to detect whether the program is deterministic, simulators could use that method to automatically activate the prefix optimisation as appropriate.

Toy problem

Toy transsys program

transsys b
{
 factor f 
 { 
   decay: 0.1; 
   diffusibility: 0.1; 
 }
 
 gene g
 {
   promoter
   {
     f: activate(1, 10);
   }
   product
   {
     default: f;
   }
 } 
}

Toy Simgenex spec

ObjectiveSpecification-0.1


globalsettingdefs
transformation: log
distance: correlation
offset: 1e-10
endglobalsettingdefs

whitelistdefs
factor: "f"
gene: "g"
endwhitelistdefs

genemapping
factor f = "257221_at" "257222_at" "257223_at"
endgenemapping

procedure equilibration
runtimesteps: 100
endprocedure

procedure ko_g
knockout: g
endprocedure

simexpression wt
equilibration
endsimexpression

simexpression g
ko_g
equilibration
endsimexpression

arraymapping
array gko : g / wt
endarraymapping

Toy empirical expression data

	wt	g
f	3.314894e-56	6.537199e-01