SimGenex
Contents
Introduction
Gene expression measurements are measurements that determine the amount of product of one or more genes in a biological sample. The amount or concentration of a gene product is called the expression level of the gene that encodes the product. Samples for gene expression measurement are typically cultivated at controlled conditions. While the exact conditions depend on the object of research and the specific research question, the properties that are subject to control can generally be classified into genetic properties and environmental conditions. The set of expression levels of a given gene, measured in different samples, is called the expression profile (or profile, for short) of that gene. The set of expression levels of all genes in all samples is called an expression set, or, in recognition of the “genes × conditions” format of the set, an expression matrix. Genetic properties pertain to the genetic makeup of the subjects. Specifically, genes may be knocked out (loss of function mutations), or they may be overexpressed (gain of function mutations). There is a wide range of environmental conditions that biological subjects may be exposed to. A frequent condition is treatment with some agent, such as a hormone, drug, or other effector. Gene expression levels that have been measured are subjected to various mathematical operations. It is common practice to work in the logarithmic domain (i.e. to take the logarithm of the raw expression levels), because upand down-regulation can be directly compared with such “logarithmised” values. Gene expression measurement can sometimes produce negative values as an artifact. This must be addressed before values are transformed to the logarithmic domain. Adding a small offset is a simple remedy of this problem. Once gene expression levels are adequately conditioned, expression profiles can be compared. Quantitatively, comparison takes place by defining a distance measure that quantifies how dissimilar two profiles are. Two straightforward distance measures are the Euclidean distance and the correlation distance (which is a semi-metric distance), defined as 1 − r(g1, g2), where r(g1, g2) denotes correlation coefficient between the expression profiles of genes g1 and g2. The sum of distances of expression profiles is a distance between two expression sets. The transsys framework provides a basis for simulating regulatory networks with different genetic properties, and for deriving loss or gain of function variants of a given regulatory network by removing or adding genes, respectively. Different environmental conditions can be simulated by designating factors that are subject to external alteration, and using different settings of the the expression levels of these factors to simulate different conditions. The language defined here is designed to enable succinct and flexible specification of such biological processes and experimental procedures in silico that result in a simulated expressionmatrix, and also to specify a distancemeasure to compare the simulated matrix to an target matrix comprised of expression data that is externally provided (i.e. not generated by way of simulation). The target matrix is also called the empirical matrix. In addition to this specification, a transsys program, called the candidate program, that models the regulatory network is required to carry out the simulation. Candidate programs must satisfy certain criteria in order to be suitable for simulation according to a specification. Specifically, the transsys program needs to have factors and genes that are specified by name in the simulation protocol specification. Within these requirements, candidate programs can be freely chosen.
Objective Function Specification
Language Structure
simgenex_def ::= procedure_defs simexpression_defs measurementmatrix_def discriminationsettings_def
The core of a SimGenex program describes how to use a transsys GRN model to produce a simulated gene expression matrix from the definition of a set simexpressions of primary operations that are sufficiently general to simulate most standard experimental procedures. The measurementmatrix block describes how to transform the primary simulated matrix into a measurement matrix by e.g.\ computing log-ratios. Finally, the discriminationsettings block configures computation of the distance of the measurement matrix to a target matrix.
Procedures
procedure_defs ::= (procedure_def)+
procedure_def ::= procedure_header procedure_body procedure_footer NEWLINE
procedure_header ::= "procedure" identifier NEWLINE procedure_body ::= (instruction)+ instruction ::= procedure_identifier | primary_instruction primary_instruction ::= "knockout:" identifier NEWLINE | "runtimesteps:" integer NEWLINE | "treatment:" identifier "=" realnumber NEWLINE | "overexpress:" identifier "=" realnumber NEWLINE procedure_footer ::= "endprocedure" NEWLINE procedure_identifier ::= identifier
A procedure specifies a sequence of operations to be performed on a transsys instance. Operations are specified either by primary instructions or by other procedures. Primary instructions specify elementary operations that the simulator knows to perform. These are: • runtimesteps runs the specified number of time steps to create a new transsys instance. • knockout removes the specified gene from the transsys program. The identifiermust be the name of a gene in the candidate program. The knockout affects gene expression simulation (via the runtimesteps instruction) issued subsequently to the knockout instruction. Notice that the knockout operation modifies the candidate program. • treatment takes the name of a factor and a value that the expression level of the factor is to be set to. This operation is applied to the current transsys instance, overwriting the previous expression level of the factor. Subsequently, the expression dynamics of the factor will be determined by the candidate program. • overexpress inserts a new gene into the candidate program. The identifier is the name of an existing gene in the candidate program. The new gene encodes the same product as the specified existing gene, and has a promoter comprised of one constitutive element, expressing the gene at the specified rate. An identifier in a procedure body identifies another procedure to be invoked. Invoking another procedure results in execution of the instruction in the other procedure’s body. By recursively applying this rule, a procedure ultimately reduces to a sequence of primary instructions. It is an error for a procedure to refer to itself, or to any procedure that eventually invokes itself, as infinite recursion would occur in this case.
It is an error if an identifier does not reference an existing procedure. Procedures may be listed in any order, so it is legal to reference a procedure before it is defined.
Simulating Gene Expression
simexpression_defs ::= (simexpression_def)+
simexpression_def ::= simexpression_header simexpression_body simexpression_footer NEWLINE
simexpression_header ::= "simexpression" identifier NEWLINE simexpression_body ::= (identifier NEWLINE)+ simexpression_footer ::= "endsimexpression" NEWLINE
Simulations of gene expression, or ``simexpressions for short, describe a simulation procedure to produce a transsys instance. The idea is that the simulation procedure models the genetic makeup and the relevant conditions and (possibly) experimental manipulations experienced by a biological object. If the candidate program is a good model of the gene regulatory network in the biological object, the expression levels in the transsys instance are expected to be similar to those measured in the biological object.
Like procedures, simexpressions may be composed of primary instructions and procedure invocations. In addition, they also may contain foreach instructions. Such simexpressions define multiple columns in the simulated matrix. The foreach instruction enables very compact specifications of setups in which a number of strains are subjected to the same set of experimental conditions. For example, the declaration:
simexpression s { foreach: wildtype komutant; equilibration; foreach mock real; onehour; }
specifies four columns in which the genotypes wildtype and komutant are subjected to mock and the real treatment. The procedures komutant, mock and rea} have to be defined in order for the above code fragment to work.
The sequence of identifiers in the body of a simexpression is resolved to a sequence of primary instructions, as described for procedures in section section procedures. The sequence of primary operations is applied to a transsys instance of the unmodified candidate program, with all expression levels starting at 0. (Note: Future extensions may provide mechanisms for specifying the initial state of the instance.)
Identifiers in a simexpression body must identify procedures. Invocation of other simexpressions is an error.
Computing the Simulated Matrix
measurementmatrix_def ::= "measurementmatrix" "{" measurementprocess_def measurementcolums_def "}"
As in the wet lab scenario, the columns of a matrix simulated by SimGenex need to be transformed following the same protocols that were applied to compute the target matrix of empirical data. SimGenex uses the following blocks within the measurementmatrix section to specify such procedures: • measurementprocess: specifies how individual gene expression values are normalised. offset how expression values are transformed to simulate a column in a gene expression matrix transformation • measurementcolumns: specifies the columns in the simulated expression matrix. Columns are computed by subjecting the expression levels in one or more simexpressions to mathematical operations, resulting in a column containing one value for each mapped factor of the candidate program. The idea is that the mathematical operations should be the same as those applied to the raw empirical data that have resulted in the empirical expression matrix.
Discrimination Settings and Gene mapping
discriminationsettings_def ::= "discriminationdettings" "{" genemapping distance_def whitelist_def "}" genemapping_def ::= genemapping_header genemapping_body genemapping_footer genemapping_header ::= "genemapping" genemapping_body ::= (factor_def )+ genemapping_footer ::= "endgenemapping" factor_def ::= "factor" identifier "=" (gene_manufacturer_identifier)+
genemapping specifies names of genes in the computational model that can be mapped to names in the target matrix, which may e.g. be IDs designated by the microarray provider. If there are multiple profiles in the empirical matrix that correspond to one factor, these may be specified as a whitespace separated list. In this case, the average profile is used for comparison. The identifiers of the right hand side must be names of factors in the candidate program. No factor may be mapped more than once.
distancemeasure_def ::= "distance:" distance_type distance_type ::= "correlation" | "euclidean" | "sum_squares"
SimGenex allows the specification of a distance measure to compare the simulated matrix to a target matrix which can e.g. be used to discriminate the best GRN model from among a number of candidates. In addition, as in the real scenario, SimGenex allows the specification of a mapping scheme.
whitelist_def ::= whitelist_header whitelist_body whitelist_footer whitelist_header ::= "whitelistdefs" whitelist_body ::= whitelist_factor_def whitelist_gene_def whitelist_footer ::= "endwhitelistdefs" whitelist_factor_def ::= "factor" ":" (identifier)+ whitelist_gene_def ::= "gene: " (identifier)+ whitelist specifies which factors or genes a discriminator may adjust. This feature is useful where parts of
the GRN model are unknown, and the discriminator should therefore explore various alternatives for the unknown parts. As an example, where numerical parameters are unknown, these can be set by numerical optimisation. WhiteList terms are those factors or genes, in a transsys program, whose initial conditions must not be changed by the optimiser when performing model selection. There are a number of reasons why this option is available,
one of them is the use of mathematical expressions rather than single values to specify under which conditions a biochemical reaction might take place.
Terminal Tokens (Lexical Structure)
SimGenex ::= "SimGenex-0.1" NEWLINE
NEWLINE
gene_manufacturer_identifier ::= doublequote character_but_not_doublequote+ doublequote realnumber ::= digit_sequence "." unsigned_digit_sequence+ scale_factor+ | digit_sequence scale_factor digit_sequence ::= sign+ unsigned_digit_sequence unsigned_digit_sequence ::= digit+ scale_factor ::= ("E" | "e") digit_sequence sign ::= "+" | "-" digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" identifier ::= (letter | "_") (letter | digit | "_")+ letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K"| "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W"| "X" | "Y" | "Z"| "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k"| "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" doublequote ::= the doublequote character (chr(34))
References
[1] Helen Parkinson, Misha Kapushesky, Nikolay Kolesnikov, Gabriella Rustici, Mohammad Shojatalab, Niran Abeygunawardena, Hugo Berube, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Ele Holloway, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Tim F. Rayner, Faisal Rezwan, Anjan Sharma, Eleanor Williams, Xiangqun Zheng Bradley, Tomasz Adamusiak, Marco Brandizi, Tony Burdett, Richard Coulson, Maria Krestyaninova, Pavel Kurnosov, Eamonn Maguire, Sudeshna Guha Neogi, Philippe Rocca- Serra, Susanna-Assunta Sansone, Nataliya Sklyar, Mengyao Zhao, Ugis Sarkans, and Alvis Brazma. Arrayexpress update: from an archive of functional genomics experiments to the atlas of gene expression. Nucl. Acids Res., 37:868–872, 2009.
Future Perspectives
Optimisation by Reusing Prefix Instruction Sequences
Simexpressions reduce to sequences of primary instructions. If two simexpressions share a prefix (i.e. they start with the same sequence of instructions) and the candidate program is deterministic (i.e. it does not use any of the random number generation functions provided by transsys), the two simexpressions that share a prefix sequence can be computed by first computing the prefix sequence and then using that as a starting point for computing both the first and the second simexpression. Depending on the execution time of the prefix, this can be a significant optimisation (e.g. where many time steps are used for equilibration of a transsys instance). As a special case, one simexpression may be a prefix of another. When a set of simexpressions simulates a time series, they will form nested prefixes. To fully exploit the optimisation potential especially for time series as described above, an optimiser must be able to break up a runtimesteps t instruction into commands runtimesteps n1 and runtimesteps n2, where n1 + n2 = n, as appropriate. It is important to notice that the validity of this optimisation depends on the assumption that the candidate program is deterministic. This may not hold, therefore simulating all simexpressions independently must be retained. If the transsys program object provided a method to detect whether the program is deterministic, simulators could use that method to automatically activate the prefix optimisation as appropriate.
Toy problem
Toy transsys program
transsys b { factor f { decay: 0.1; diffusibility: 0.1; } gene g { promoter { f: activate(1, 10); } product { default: f; } } }
Toy Simgenex spec
ObjectiveSpecification-0.1 globalsettingdefs transformation: log distance: correlation offset: 1e-10 endglobalsettingdefs whitelistdefs factor: "f" gene: "g" endwhitelistdefs genemapping factor f = "257221_at" "257222_at" "257223_at" endgenemapping procedure equilibration runtimesteps: 100 endprocedure procedure ko_g knockout: g endprocedure simexpression wt equilibration endsimexpression simexpression g ko_g equilibration endsimexpression arraymapping array gko : g / wt endarraymapping
Toy empirical expression data
wt g f 3.314894e-56 6.537199e-01