1 Module 1: CDISC Standards Overview
Clinical Data Science for Pharma: CDISC From Scratch
1.1 What is CDISC and Why Does It Exist?
Imagine 50 hospitals across 10 countries collecting data for the same clinical trial. Each hospital uses a different database, different variable names, different date formats, different units. At the end of the trial, you have 50 different data structures that all mean the same thing.
The FDA receives hundreds of drug submissions every year. Without a common format, reviewing the data would require learning a new data structure for every single submission. That is not scalable.
CDISC (Clinical Data Interchange Standards Consortium) solves this. It is a non-profit organisation that defines exactly how clinical trial data must be structured before submission to regulators.
Since 2016, the FDA requires CDISC-compliant data (SDTM + ADaM) for all new drug and biologics submissions. The EMA (Europe) follows similar guidance.
This means: if you want to submit a drug in the US or Europe, you must speak CDISC. And if you work at a pharma company, you speak CDISC every day.
1.2 The CDISC Family of Standards
CDISC is not one standard; it is a family. Here are the ones you will encounter most:
| Standard | Full Name | What it covers |
|---|---|---|
| CDASH | Clinical Data Acquisition Standards Harmonization | How to design CRFs (data collection forms) |
| SDTM | Study Data Tabulation Model | How to structure collected trial data |
| ADaM | Analysis Data Model | How to structure analysis-ready data |
| Define-XML | Machine-readable data dictionary for submissions | |
| SENDIG | Standard for Exchange of Non-clinical Data | Pre-clinical (animal) studies |
In this course we focus on SDTM and ADaM. These are the two standards that statistical programmers and biostatisticians work with daily.
1.3 SDTM: Study Data Tabulation Model
1.3.1 What SDTM is
SDTM defines how collected clinical trial data is organised for submission. Think of it as a standardised filing system.
All data is organised into domains; each domain is a dataset covering one type of data. Every domain has a 2-letter code.
1.3.2 The most important SDTM domains
| Domain | Code | What it contains |
|---|---|---|
| Demographics | DM | Age, sex, race, country, treatment arm |
| Adverse Events | AE | Side effects reported during the trial |
| Laboratory Tests | LB | Blood tests, urine tests, lab results |
| Vital Signs | VS | Blood pressure, heart rate, weight, height |
| Concomitant Medications | CM | Other drugs taken alongside trial drug |
| Medical History | MH | Pre-existing conditions |
| Exposure | EX | What drug was given, when, at what dose |
| Disposition | DS | Who completed or discontinued and why |
| Findings About | FA | Qualitative results linked to other domains |
For our GLPX-001 trial, the most important domains will be: DM, AE, LB, VS, EX, DS
1.3.3 SDTM structure rules
Every SDTM domain must follow strict rules. The key ones:
1. Every dataset has required variables
For example, every SDTM domain must have:
| Variable | Meaning |
|---|---|
STUDYID |
Study identifier (e.g., “GLPX-001”) |
DOMAIN |
2-letter domain code (e.g., “DM”, “AE”) |
USUBJID |
Unique subject identifier: the key that links all domains |
--SEQ |
Sequence number (e.g., AESEQ, LBSEQ) |
2. Variable names follow naming conventions
SDTM variable names are constructed from a prefix (the domain code) and a suffix that describes the content. For example in the AE domain:
| Variable | Meaning |
|---|---|
AEDECOD |
Adverse event decoded term (standardised MedDRA term) |
AESTDTC |
Adverse event start date (ISO 8601 format) |
AEENDTC |
Adverse event end date |
AESEV |
Severity (MILD, MODERATE, SEVERE) |
AESER |
Serious adverse event flag (Y/N) |
AEREL |
Relationship to study drug |
3. Dates are always ISO 8601
All dates in SDTM use the format YYYY-MM-DD. Partial dates are allowed: 2024-03 means March 2024 when the exact day is unknown.
4. Controlled terminology
Many variables have a fixed list of allowed values. You cannot invent your own. For example:
SEX: M, F, U, UNDIFFERENTIATEDRACE: WHITE, BLACK OR AFRICAN AMERICAN, ASIAN, etc.AESEV: MILD, MODERATE, SEVERE
These come from the CDISC Controlled Terminology list, published and updated regularly at evs.nci.nih.gov/ftp1/CDISC/
1.3.4 What SDTM looks like: DM domain example
Here is what a Demographics (DM) domain looks like for 5 subjects from our GLPX-001 trial:
| STUDYID | DOMAIN | USUBJID | AGE | SEX | RACE | ARMCD | ARM |
|---|---|---|---|---|---|---|---|
| GLPX-001 | DM | GLPX-001-001-0001 | 54 | M | WHITE | DRUG1 | Drug 1mg |
| GLPX-001 | DM | GLPX-001-001-0002 | 62 | F | WHITE | DRUG2 | Drug 2mg |
| GLPX-001 | DM | GLPX-001-001-0003 | 48 | F | ASIAN | PLACEBO | Placebo |
| GLPX-001 | DM | GLPX-001-001-0004 | 71 | M | BLACK OR AFRICAN AMERICAN | DRUG1 | Drug 1mg |
| GLPX-001 | DM | GLPX-001-001-0005 | 39 | F | WHITE | DRUG2 | Drug 2mg |
Notice:
USUBJIDis globally unique; it encodes study + site + subjectARMCDis the short code,ARMis the full label- Every row is one subject (DM is a one-row-per-subject domain)
1.3.5 What SDTM looks like: LB domain example
Laboratory data is different; one subject has many lab results over time. So LB has one row per test per timepoint per subject:
| STUDYID | DOMAIN | USUBJID | LBSEQ | LBTESTCD | LBTEST | LBORRES | LBORRESU | LBDTC |
|---|---|---|---|---|---|---|---|---|
| GLPX-001 | LB | GLPX-001-001-0001 | 1 | HBA1C | Hemoglobin A1C | 8.2 | % | 2024-01-15 |
| GLPX-001 | LB | GLPX-001-001-0001 | 2 | GLUC | Glucose | 9.1 | mmol/L | 2024-01-15 |
| GLPX-001 | LB | GLPX-001-001-0001 | 3 | HBA1C | Hemoglobin A1C | 7.4 | % | 2024-04-15 |
| GLPX-001 | LB | GLPX-001-001-0002 | 1 | HBA1C | Hemoglobin A1C | 9.0 | % | 2024-01-15 |
Notice:
LBSEQis the sequence number: unique per subjectLBTESTCDis the short code,LBTESTis the full nameLBORRES/LBORRESU= original result and unit as reportedLBDTC= date of the lab test in ISO 8601 format
1.4 ADaM: Analysis Data Model
1.4.1 What ADaM is
SDTM is for tabulating what was collected. ADaM is for analysing it.
ADaM datasets are derived from SDTM and are designed specifically to support statistical analyses and TLF production. They contain derived variables that do not exist in SDTM: things like:
- Baseline values
- Change from baseline
- Analysis flags (which visits/records to include in analysis)
- Imputed values
1.4.2 The traceability principle
The most important concept in ADaM is traceability: every derived value in ADaM must be traceable back to SDTM. You must always be able to answer: where did this number come from?
This is why regulators trust the analysis: they can follow the chain from raw data → SDTM → ADaM → TLF.
1.4.3 The most important ADaM datasets
| Dataset | What it contains |
|---|---|
| ADSL | Subject-Level Analysis Dataset: one row per subject, all key baseline and treatment variables |
| ADAE | Adverse Events: analysis-ready AE data |
| ADLB | Laboratory Analysis Dataset: with baseline, change from baseline |
| ADEFF | Efficacy dataset (trial-specific) |
| ADTTE | Time-to-Event dataset (survival analyses) |
ADSL is the most important. Almost every other ADaM dataset merges with ADSL to pick up subject-level variables. You always build ADSL first.
1.4.4 What ADaM looks like: ADSL example
Here is a partial ADSL for 3 subjects:
| USUBJID | AGE | SEX | RACE | BMI | TRT01P | RANDDT | HBA1CBL | SAFFL | ITTFL |
|---|---|---|---|---|---|---|---|---|---|
| GLPX-001-001-0001 | 54 | M | WHITE | 31.2 | Drug 1mg | 2024-01-15 | 8.2 | Y | Y |
| GLPX-001-001-0002 | 62 | F | WHITE | 28.9 | Drug 2mg | 2024-01-15 | 9.0 | Y | Y |
| GLPX-001-001-0003 | 48 | F | ASIAN | 33.5 | Placebo | 2024-01-16 | 7.8 | Y | Y |
New variables that did not exist in SDTM:
| Variable | Meaning |
|---|---|
TRT01P |
Planned treatment for Period 1 |
RANDDT |
Randomisation date |
HBA1CBL |
HbA1c at baseline (derived from LB) |
SAFFL |
Safety population flag (Y/N) |
ITTFL |
Intent-to-treat population flag (Y/N) |
1.4.5 What ADaM looks like: ADLB example
| USUBJID | PARAMCD | PARAM | AVISIT | ADT | AVAL | BASE | CHG | ANL01FL |
|---|---|---|---|---|---|---|---|---|
| GLPX-001-001-0001 | HBA1C | HbA1c (%) | Baseline | 2024-01-15 | 8.2 | 8.2 | 0 | Y |
| GLPX-001-001-0001 | HBA1C | HbA1c (%) | Week 26 | 2024-07-15 | 7.1 | 8.2 | -1.1 | Y |
| GLPX-001-001-0001 | HBA1C | HbA1c (%) | Week 52 | 2025-01-15 | 6.8 | 8.2 | -1.4 | Y |
Key ADaM variables:
| Variable | Meaning |
|---|---|
PARAMCD |
Parameter code (what is being measured) |
PARAM |
Parameter label |
AVISIT |
Analysis visit label |
ADT |
Analysis date |
AVAL |
Analysis value (the actual number used in analysis) |
BASE |
Baseline value |
CHG |
Change from baseline (AVAL − BASE) |
ANL01FL |
Analysis flag: Y means include this record in the primary analysis |
1.5 TLFs: Tables, Listings, and Figures
TLFs are the outputs: what the statistician and medical writer use to write the Clinical Study Report. They are produced from ADaM datasets.
1.5.1 Tables
Structured summaries. Examples:
- Table 14.1.1: Summary of Demographics and Baseline Characteristics
- Table 14.2.1: Primary Efficacy: Change from Baseline in HbA1c at Week 52
- Table 14.3.1: Overview of Adverse Events
A demographics table looks like this:
Table 14.1.1 Summary of Demographics (Safety Population)
Drug 1mg Drug 2mg Placebo Total
(N=300) (N=300) (N=300) (N=900)
─────────────────────────────────────────────────────────────────────────
Age (years)
Mean (SD) 54.2 (9.1) 53.8 (8.7) 54.5 (9.3) 54.2 (9.0)
Median 54.0 53.0 55.0 54.0
Min, Max 28, 75 31, 74 29, 75 28, 75
Sex, n (%)
Male 152 (50.7) 148 (49.3) 155 (51.7) 455 (50.6)
Female 148 (49.3) 152 (50.7) 145 (48.3) 445 (49.4)
HbA1c at Baseline (%)
Mean (SD) 8.4 (0.7) 8.3 (0.8) 8.4 (0.7) 8.4 (0.7)
─────────────────────────────────────────────────────────────────────────
1.5.2 Listings
Raw data printed in a structured format; used for data verification. Example: Listing of all serious adverse events, one row per event.
1.5.3 Figures
Plots. Examples:
- Mean HbA1c over time by treatment arm (line plot)
- Forest plot of subgroup analyses
- Kaplan-Meier survival curves
1.6 How SDTM, ADaM and TLFs Connect
Let us trace our primary endpoint: Change from Baseline in HbA1c at Week 52; through the entire pipeline:
EDC (raw)
└── Subject 001 had HbA1c = 8.2% on 2024-01-15
└── Subject 001 had HbA1c = 6.8% on 2025-01-15
│
▼
SDTM: LB domain
└── USUBJID=GLPX-001-001-0001, LBTESTCD=HBA1C, LBORRES=8.2, LBDTC=2024-01-15
└── USUBJID=GLPX-001-001-0001, LBTESTCD=HBA1C, LBORRES=6.8, LBDTC=2025-01-15
│
▼
ADaM: ADLB
└── USUBJID=GLPX-001-001-0001, PARAMCD=HBA1C, AVISIT=Baseline, AVAL=8.2, BASE=8.2, CHG=0
└── USUBJID=GLPX-001-001-0001, PARAMCD=HBA1C, AVISIT=Week 52, AVAL=6.8, BASE=8.2, CHG=-1.4, ANL01FL=Y
│
▼
TLF: Table 14.2.1
└── Drug 1mg: Mean change from baseline in HbA1c at Week 52 = -1.4% (SD 0.6)
└── Placebo: Mean change from baseline in HbA1c at Week 52 = -0.3% (SD 0.5)
└── Treatment difference: -1.1% (95% CI: -1.3, -0.9), p < 0.001
This chain, from a single lab value entered by a nurse at a hospital to a p-value in a regulatory submission, is what you are learning to build.
1.7 Quick Reference: SDTM vs ADaM
| Feature | SDTM | ADaM |
|---|---|---|
| Purpose | Tabulate collected data | Support analysis |
| Source | Raw EDC data | SDTM datasets |
| Structure | One domain per data type | One dataset per analysis need |
| Derived variables | Minimal | Many (BASE, CHG, flags) |
| Required by FDA | Yes | Yes |
| Key datasets | DM, AE, LB, VS, EX | ADSL, ADAE, ADLB |
| Row structure | One row per observation | One row per subject per parameter per visit |
1.8 Module 1 Summary
1.9 Your Tasks Before Module 2
Answer to the checklist questions (reveal after you have thought about it):
LBORRES is the original result exactly as reported (a character string like “8.2”). AVAL is the numeric analysis value after cleaning, standardising units, and applying any imputation rules.
BASE does not exist in SDTM because baseline is an analytical concept: it depends on how you define it in the SAP. Different analyses can have different baselines.
For GLPX-001 you would need at minimum: DM, AE, LB, VS, EX, DS, CM, MH.
1.10 What’s Next
In Module 2 we get hands-on with SDTM. We will:
- Look at the simulated raw GLPX-001 data
- Map it to SDTM domains following the rules
- Write the actual SAS and R code to produce a DM domain and an LB domain
- Understand what a SDTM Reviewer’s Guide (SDRG) is
This course is open source and free forever. Found an error or want to contribute? Open an issue or pull request on GitHub.