Tools for validating data in EDC

5th September 2008 No Comments


I have had the fortune to work with a number of different EDC products over the years.  In each case, they implemented features that allowed a study developer to create validate rules associated with the data captured.  Some were good and some were bad. In almost all cases, it was difficult to appreciate the tools shortcomings until a solid amount of work was carried out utilizing them.  They say the devil is in the detail – with EDC edit checking tools, this certainly proves to be the case.

I would like to discuss (or rather ramble on) about the history of validation rule tools in EDC – at least from the mid 90’s.


1st Generation Tools – SQL Base

The early tools, primarily before EDC, used either SQL, or an pre-compiled version of SQL together with a scripting language as a syntax for edit checking.  This approach had the advantage that the data access was standardized to a degree with SQL.  The disadvantage was that the SQL worked directly against an underlying database. The developer had to understand and operate against the underlying database in order to correctly write edit checks.

PL/SQL was a common language to leverage. Tools such as Clintrial and DLB Recorder (now eResearch Technologies – eXpert Data Management) relied heavily on the logic and data constructs provided.

The downside to (PL/)SQL based edit check syntaxes were that they often assumed that the underlying database was a relational database that matched the structure of screens that the logic was often associated with.  The product had to therefore be a relational database building tool – good on the surface, but not good when it came to meeting the needs and flexibility of EDC.

2nd Generation Tools – Expression Builders

In the early to mid 1990’s, a new set of tools arrived that generally attempted to take away much of the earlier complexity of 1st Generation tools, and, that took advantage of the fact that the underlying data structures were not relational.

The first set of 2nd generation tools tackled the issue of logical data checking through the provision of expression building tools. Initially, these were restrictive with the only means to build the expressions being through a thick client front-end with no free-format expression entry possible.  This made the tool development somewhat easier, and, the corresponding expression parsing simple.  The downside to the approach though was the  it was not possible to define all the required rules in the provided expression builders.

3rd Generation Tools – Hybrid Logic Builders

Expression builders alone were seen as being too restrictive in the development of a complete set of edit checks for a study.  Also, Power users felt constrained. The fallback position for implementations were that edit checking had to be performed at the back-end with SAS or SQL queries. 

To work around these limitations, a 3rd generation of tools were produced that provided a combination of expression building as well as direct syntax entry.  The direct syntax entry was either provided by allowing the developer to edit and extend the code that was derived through the expression builder, or, it was provided as an alternative to expression built code.

The added flexibility of the direct syntax provided a mechanism allowing studies to tackle close to 100% of all edit checking associated with a protocol.  Back end data checking was limited to rules that could not be determined prior to study rollout.

One limitation of the syntax approach is the issue of testing.  With a point and click configuration approach, the scope of testing can be controlled, and, to a degree even automated.  With a open language based syntax, the potential combinations that need to be tested for are higher.  In fact, the testing that may be applied here is equivalent to the sort of testing that is required when carrying out full system validation.  I will be discussing the methods of, and challenges in, testing studies in a later post.

Key Factors in good Validation Rule Syntax

Absolute Data References

This topic primarily applies to script based systems rather than expression builders.  Expression Builders typically present the metadata that exist in drop down lists (Visits, Forms or Field Names). Referencing data fields in a free format expression can be somewhat more challenging.  Take the following example;


This is an example of an absolute reference to a data field. The 1st Inclusion question on the Inclusion/Exclusion Form in Visit 1. But why the brackets?  Well, the metadata has spaces and a / in the names. To ensure the interpreter doesn’t think the space or / separates this operand from an operator the brackets bound them.   A way around this of course is to use names without spaces. CDISC offers this with Object Identifiers or OID’s. They don’t have spaces, so, the issue does not arise.  However, the OIDS can be less than friendly when it comes to making an expression human readable. Anyway, OID’s or equivalents are standard practice except where the number of elements in an expression are always fixed.  Even with OID’s though, the length of these logical expressions can be horrific.

Programming languages have simplified the issue of dealing with long qualifications by providing variables (or aliases). You define the alias at the start, and then refer to the simple alias name through the expression.  So for the above – you could say;


Then, to compare values, it might be

if INC1= ‘No’ then Raise Query "XXXXXX"

Wildcard Data References

It is common for the same eCRF pages to be dropped into multiple visits.  In this circumstance, the visit references that may exist in attached edit checks need to change.  The way this is usually achieved is through wildcarding.

If the above Inclusion / Exclusion check appeared in say Visit 1 and Visit 2 (hypothetically) then the Visit reference would need to be wildcarded to ensure it does not refer to Visit 1 when the form is dropped into Visit 2. The tool would replace the wildcard with the appropriate reference based on where the form appears. You can infact have 3 types of reference – Absolute as above, Relative or Any.

‘Any’ or ‘Current’ style references are often presented with a ‘*’ a ‘$’ or some other special symbol. This designates that the element is replaced

Relative references are usually derived based on the original source of the edit check. So, if the edit check fired in Visit 2, and the relative reference stated -1 – or ‘previous’, then this might indicate the current visit -1.

Wildcarding causes some difficulties though when it comes to reuse.  Testing is only predictable when the logic is applied within a study.  If you take the form out of one study and drop it into another, it is possible that with a different visit structure, you may obtain different – potentially undesirable – results.


Data References do have an impact on a number of areas. Well designed data referencing ensures that the maximum amount of re-use can be achieved from study to study as well as within a study. Also, the readability of validation rules is important.  If the Protocol Designer or Data Manager cannot understand the rule that is presented from an EDC system, how can it be assured that the rule is correct?


Other Considerations – Actions

The boolean true/false results of an edit check expression is only one side of the expression.  The other side is the action that place either as the result of a true, or a false result.  Systems designs seem to fall evenly on one of two approaches.  Either the syntax allows one or more actions some of which are to create a Discrepancy (or Query). The second type is where the Discrepancy is the only, and therefore the assumed action.   Clinical Data Management systems often went with just the Logic –> Query approach as the need for actions outside of Discrepancies was limited.

With most modern EDC systems, the edit check language provides the means to carry out one or more actions as the result of a boolean. Additional actions that might be support are things like status changing, assigning values, or even activating or inactivating metadata elements such as forms and fields. Some systems separate the query creation from other actions. The reason behind this is normally to help support protocol updates.  If a tool mixes queries in with other activities, it can be very difficult to deal with the situation where a new protocol definition needs to be applied to existing data.  For queries, it is easy – re-run the logic and if a query is created that was not previously created – then add it.  For other actions – a bit more tricky – for example, if you had set up your system to send out emails if an Serious Adverse Event occurred, then, you wouldn’t want the emails to be re-submitted when you applied a protocol update.


Other Considerations – Batch Processing

This is an interesting one. Anyone that has sold EDC systems will have been asked this questions.

Does your system support Batch execution of validation rules?

With very limited exceptions, the answer was always no.  I could argue that in some cases, due to pressure from sales, batch execution was added to the detriment of the EDC products. EDC systems are designed along the principle that bad data is corrected immediately by presenting errors as soon as possible to the data entry person.

The only argument for batch processing that I have seen applied for a positive reason is in the area of performance.  An EDC system that suffers poor performance may resort to batch execution to improve page response times. However, this is often unsatisfactory – CDM systems run across data typically using the efficiencies that single SQL Select statements can bring.  EDC systems often operate on a datapoint by datapoint basis with only limited cache optimisation possible. Batch running EDC edit checks can be tortuously slow. Presenting queries after the user has left the page is also sub-optimal.


The Future?


A gap exist right now in the development of standard, (i.e. CDISC) where they pertain to rules applied to data.   

Why do we need standards for Rules? I hear you say.  

Well, from an EDC study build perspective, the associated edit checks is often the largest single work effort in preparing the study. In fact, in comparison to preparing forms and folders, the edit check together with testing can often be 3-4 times more work. So, when attempting to leverage standards such as ODM, the relative savings that can be achieved related to automating the study build are limited.  

The second reason behind the need for standards around rules is the potential knowledge that might be associated with them.   Imagine you have access to a warehouse of clinical data.  In that warehouse you have 100’s of instance of a particular set of data – lets say vital signs.  Can you use all the data? What if some of the data had restrictions that other data did not have?

Rules were originally applied to data in order to determine cleanliness within a study. These rules may also have determined whether the data reached the warehouse.  Inappropriate data may have been filtered out.   By taking away the rules in the warehouse, you take away a proportion of the context behind the data.  If you take the rules – that form part of the metadata – can you really utilize the data in an unbiased way?    Maybe Statisticians will say this doesn’t happen, or, the impact is negligible… I am happy to receive comments.


As mentioned in a recent posting on eClinical Opinion there are many input requirements for validation logic.  If you are thinking proprietary, then you will want a syntax that is as close to the sort of language that is used in protocol definitions as possible, will at the same time as concise as is necessary to assure re-use and non ambiguity.  The point and click builders will not go away – they can be useful.  At the same time though, for power users, you need to have high end editor features.  I believe the strongest vendors will create editors that are clinical trial business object aware. When building syntax, they will know for instance that a field may be qualified by a sequence no. or form.  They will understand that given two dates, an Age can be derived.

Device independency may become significant once again. In the past, provided you ran on a browser – things were fine. However, who wants to enter patient diary data on a PDA Browser.  The iPhone is a perfect example. The iPhone Apps are not browser apps. They make use of the internet, but the leverage another UI. By taking the validation rules away from the front end, the actual device used to offer up the questions will not mater. The same rules apply regardless of the capture medium.

Leave a Comment