1 Module 1: CDISC Standards Overview

Clinical Data Science for Pharma: CDISC From Scratch

Author

Sufyan Suleman

Published

May 17, 2026

1.1 What is CDISC and Why Does It Exist?

Imagine 50 hospitals across 10 countries collecting data for the same clinical trial. Each hospital uses a different database, different variable names, different date formats, different units. At the end of the trial, you have 50 different data structures that all mean the same thing.

The FDA receives hundreds of drug submissions every year. Without a common format, reviewing the data would require learning a new data structure for every single submission. That is not scalable.

CDISC (Clinical Data Interchange Standards Consortium) solves this. It is a non-profit organisation that defines exactly how clinical trial data must be structured before submission to regulators.

Since 2016, the FDA requires CDISC-compliant data (SDTM + ADaM) for all new drug and biologics submissions. The EMA (Europe) follows similar guidance.

This means: if you want to submit a drug in the US or Europe, you must speak CDISC. And if you work at a pharma company, you speak CDISC every day.

1.2 The CDISC Family of Standards

CDISC is not one standard; it is a family. Here are the ones you will encounter most:

Standard	Full Name	What it covers
CDASH	Clinical Data Acquisition Standards Harmonization	How to design CRFs (data collection forms)
SDTM	Study Data Tabulation Model	How to structure collected trial data
ADaM	Analysis Data Model	How to structure analysis-ready data
Define-XML		Machine-readable data dictionary for submissions
SENDIG	Standard for Exchange of Non-clinical Data	Pre-clinical (animal) studies

In this course we focus on SDTM and ADaM. These are the two standards that statistical programmers and biostatisticians work with daily.

1.3 SDTM: Study Data Tabulation Model

1.3.1 What SDTM is

SDTM defines how collected clinical trial data is organised for submission. Think of it as a standardised filing system.

All data is organised into domains; each domain is a dataset covering one type of data. Every domain has a 2-letter code.

1.3.2 The most important SDTM domains

Domain	Code	What it contains
Demographics	DM	Age, sex, race, country, treatment arm
Adverse Events	AE	Side effects reported during the trial
Laboratory Tests	LB	Blood tests, urine tests, lab results
Vital Signs	VS	Blood pressure, heart rate, weight, height
Concomitant Medications	CM	Other drugs taken alongside trial drug
Medical History	MH	Pre-existing conditions
Exposure	EX	What drug was given, when, at what dose
Disposition	DS	Who completed or discontinued and why
Findings About	FA	Qualitative results linked to other domains

For our GLPX-001 trial, the most important domains will be: DM, AE, LB, VS, EX, DS

1.3.3 SDTM structure rules

Every SDTM domain must follow strict rules. The key ones:

1. Every dataset has required variables

For example, every SDTM domain must have:

Variable	Meaning
`STUDYID`	Study identifier (e.g., “GLPX-001”)
`DOMAIN`	2-letter domain code (e.g., “DM”, “AE”)
`USUBJID`	Unique subject identifier: the key that links all domains
`--SEQ`	Sequence number (e.g., `AESEQ`, `LBSEQ`)

2. Variable names follow naming conventions

SDTM variable names are constructed from a prefix (the domain code) and a suffix that describes the content. For example in the AE domain:

Variable	Meaning
`AEDECOD`	Adverse event decoded term (standardised MedDRA term)
`AESTDTC`	Adverse event start date (ISO 8601 format)
`AEENDTC`	Adverse event end date
`AESEV`	Severity (MILD, MODERATE, SEVERE)
`AESER`	Serious adverse event flag (Y/N)
`AEREL`	Relationship to study drug

3. Dates are always ISO 8601

All dates in SDTM use the format YYYY-MM-DD. Partial dates are allowed: 2024-03 means March 2024 when the exact day is unknown.

4. Controlled terminology

Many variables have a fixed list of allowed values. You cannot invent your own. For example:

SEX: M, F, U, UNDIFFERENTIATED
RACE: WHITE, BLACK OR AFRICAN AMERICAN, ASIAN, etc.
AESEV: MILD, MODERATE, SEVERE

These come from the CDISC Controlled Terminology list, published and updated regularly at evs.nci.nih.gov/ftp1/CDISC/

1.3.4 What SDTM looks like: DM domain example

Here is what a Demographics (DM) domain looks like for 5 subjects from our GLPX-001 trial:

STUDYID	DOMAIN	USUBJID	AGE	SEX	RACE	ARMCD	ARM
GLPX-001	DM	GLPX-001-001-0001	54	M	WHITE	DRUG1	Drug 1mg
GLPX-001	DM	GLPX-001-001-0002	62	F	WHITE	DRUG2	Drug 2mg
GLPX-001	DM	GLPX-001-001-0003	48	F	ASIAN	PLACEBO	Placebo
GLPX-001	DM	GLPX-001-001-0004	71	M	BLACK OR AFRICAN AMERICAN	DRUG1	Drug 1mg
GLPX-001	DM	GLPX-001-001-0005	39	F	WHITE	DRUG2	Drug 2mg

Notice:

USUBJID is globally unique; it encodes study + site + subject
ARMCD is the short code, ARM is the full label
Every row is one subject (DM is a one-row-per-subject domain)

1.3.5 What SDTM looks like: LB domain example

Laboratory data is different; one subject has many lab results over time. So LB has one row per test per timepoint per subject:

STUDYID	DOMAIN	USUBJID	LBSEQ	LBTESTCD	LBTEST	LBORRES	LBORRESU	LBDTC
GLPX-001	LB	GLPX-001-001-0001	1	HBA1C	Hemoglobin A1C	8.2	%	2024-01-15
GLPX-001	LB	GLPX-001-001-0001	2	GLUC	Glucose	9.1	mmol/L	2024-01-15
GLPX-001	LB	GLPX-001-001-0001	3	HBA1C	Hemoglobin A1C	7.4	%	2024-04-15
GLPX-001	LB	GLPX-001-001-0002	1	HBA1C	Hemoglobin A1C	9.0	%	2024-01-15

Notice:

LBSEQ is the sequence number: unique per subject
LBTESTCD is the short code, LBTEST is the full name
LBORRES / LBORRESU = original result and unit as reported
LBDTC = date of the lab test in ISO 8601 format

1.4 ADaM: Analysis Data Model

1.4.1 What ADaM is

SDTM is for tabulating what was collected. ADaM is for analysing it.

ADaM datasets are derived from SDTM and are designed specifically to support statistical analyses and TLF production. They contain derived variables that do not exist in SDTM: things like:

Baseline values
Change from baseline
Analysis flags (which visits/records to include in analysis)
Imputed values

1.4.2 The traceability principle

The most important concept in ADaM is traceability: every derived value in ADaM must be traceable back to SDTM. You must always be able to answer: where did this number come from?

This is why regulators trust the analysis: they can follow the chain from raw data → SDTM → ADaM → TLF.

1.4.3 The most important ADaM datasets

Dataset	What it contains
ADSL	Subject-Level Analysis Dataset: one row per subject, all key baseline and treatment variables
ADAE	Adverse Events: analysis-ready AE data
ADLB	Laboratory Analysis Dataset: with baseline, change from baseline
ADEFF	Efficacy dataset (trial-specific)
ADTTE	Time-to-Event dataset (survival analyses)

ADSL is the most important. Almost every other ADaM dataset merges with ADSL to pick up subject-level variables. You always build ADSL first.

1.4.4 What ADaM looks like: ADSL example

Here is a partial ADSL for 3 subjects:

USUBJID	AGE	SEX	RACE	BMI	TRT01P	RANDDT	HBA1CBL	SAFFL	ITTFL
GLPX-001-001-0001	54	M	WHITE	31.2	Drug 1mg	2024-01-15	8.2	Y	Y
GLPX-001-001-0002	62	F	WHITE	28.9	Drug 2mg	2024-01-15	9.0	Y	Y
GLPX-001-001-0003	48	F	ASIAN	33.5	Placebo	2024-01-16	7.8	Y	Y

New variables that did not exist in SDTM:

Variable	Meaning
`TRT01P`	Planned treatment for Period 1
`RANDDT`	Randomisation date
`HBA1CBL`	HbA1c at baseline (derived from LB)
`SAFFL`	Safety population flag (Y/N)
`ITTFL`	Intent-to-treat population flag (Y/N)

1.4.5 What ADaM looks like: ADLB example

USUBJID	PARAMCD	PARAM	AVISIT	ADT	AVAL	BASE	CHG	ANL01FL
GLPX-001-001-0001	HBA1C	HbA1c (%)	Baseline	2024-01-15	8.2	8.2	0	Y
GLPX-001-001-0001	HBA1C	HbA1c (%)	Week 26	2024-07-15	7.1	8.2	-1.1	Y
GLPX-001-001-0001	HBA1C	HbA1c (%)	Week 52	2025-01-15	6.8	8.2	-1.4	Y

Key ADaM variables:

Variable	Meaning
`PARAMCD`	Parameter code (what is being measured)
`PARAM`	Parameter label
`AVISIT`	Analysis visit label
`ADT`	Analysis date
`AVAL`	Analysis value (the actual number used in analysis)
`BASE`	Baseline value
`CHG`	Change from baseline (AVAL − BASE)
`ANL01FL`	Analysis flag: Y means include this record in the primary analysis

1.5 TLFs: Tables, Listings, and Figures

TLFs are the outputs: what the statistician and medical writer use to write the Clinical Study Report. They are produced from ADaM datasets.

1.5.1 Tables

Structured summaries. Examples:

Table 14.1.1: Summary of Demographics and Baseline Characteristics
Table 14.2.1: Primary Efficacy: Change from Baseline in HbA1c at Week 52
Table 14.3.1: Overview of Adverse Events

A demographics table looks like this:

Table 14.1.1 Summary of Demographics (Safety Population)

                          Drug 1mg     Drug 2mg     Placebo      Total
                          (N=300)      (N=300)      (N=300)      (N=900)
─────────────────────────────────────────────────────────────────────────
Age (years)
  Mean (SD)               54.2 (9.1)   53.8 (8.7)   54.5 (9.3)   54.2 (9.0)
  Median                  54.0         53.0          55.0         54.0
  Min, Max                28, 75       31, 74        29, 75       28, 75

Sex, n (%)
  Male                    152 (50.7)   148 (49.3)   155 (51.7)   455 (50.6)
  Female                  148 (49.3)   152 (50.7)   145 (48.3)   445 (49.4)

HbA1c at Baseline (%)
  Mean (SD)               8.4 (0.7)    8.3 (0.8)    8.4 (0.7)    8.4 (0.7)
─────────────────────────────────────────────────────────────────────────

1.5.2 Listings

Raw data printed in a structured format; used for data verification. Example: Listing of all serious adverse events, one row per event.

1.5.3 Figures

Plots. Examples:

Mean HbA1c over time by treatment arm (line plot)
Forest plot of subgroup analyses
Kaplan-Meier survival curves

1.6 How SDTM, ADaM and TLFs Connect

Let us trace our primary endpoint: Change from Baseline in HbA1c at Week 52; through the entire pipeline:

EDC (raw)
  └── Subject 001 had HbA1c = 8.2% on 2024-01-15
  └── Subject 001 had HbA1c = 6.8% on 2025-01-15

        │
        ▼

SDTM: LB domain
  └── USUBJID=GLPX-001-001-0001, LBTESTCD=HBA1C, LBORRES=8.2, LBDTC=2024-01-15
  └── USUBJID=GLPX-001-001-0001, LBTESTCD=HBA1C, LBORRES=6.8, LBDTC=2025-01-15

        │
        ▼

ADaM: ADLB
  └── USUBJID=GLPX-001-001-0001, PARAMCD=HBA1C, AVISIT=Baseline, AVAL=8.2, BASE=8.2, CHG=0
  └── USUBJID=GLPX-001-001-0001, PARAMCD=HBA1C, AVISIT=Week 52,  AVAL=6.8, BASE=8.2, CHG=-1.4, ANL01FL=Y

        │
        ▼

TLF: Table 14.2.1
  └── Drug 1mg: Mean change from baseline in HbA1c at Week 52 = -1.4% (SD 0.6)
  └── Placebo:  Mean change from baseline in HbA1c at Week 52 = -0.3% (SD 0.5)
  └── Treatment difference: -1.1% (95% CI: -1.3, -0.9), p < 0.001

This chain, from a single lab value entered by a nurse at a hospital to a p-value in a regulatory submission, is what you are learning to build.

1.7 Quick Reference: SDTM vs ADaM

Feature	SDTM	ADaM
Purpose	Tabulate collected data	Support analysis
Source	Raw EDC data	SDTM datasets
Structure	One domain per data type	One dataset per analysis need
Derived variables	Minimal	Many (BASE, CHG, flags)
Required by FDA	Yes	Yes
Key datasets	DM, AE, LB, VS, EX	ADSL, ADAE, ADLB
Row structure	One row per observation	One row per subject per parameter per visit

1.8 Module 1 Summary

Key takeaways

CDISC is the mandatory standard for FDA/EMA drug submissions since 2016
SDTM organises collected data into standardised domains (DM, AE, LB…)
ADaM creates analysis-ready datasets from SDTM (ADSL, ADLB, ADAE…)
TLFs are the outputs generated from ADaM for the clinical study report
Every value in a TLF must be traceable back through ADaM to SDTM to raw data
USUBJID is the key that links every domain and every dataset together

1.9 Your Tasks Before Module 2

✅ Checklist

Look at the SDTM domain list at wiki.cdisc.org; just browse, do not study
Can you answer: what is the difference between LBORRES and AVAL?
Can you answer: why does ADLB have a BASE column but SDTM LB does not?
Think about our GLPX-001 trial: which SDTM domains would you need?

Answer to the checklist questions (reveal after you have thought about it):

LBORRES is the original result exactly as reported (a character string like “8.2”). AVAL is the numeric analysis value after cleaning, standardising units, and applying any imputation rules.

BASE does not exist in SDTM because baseline is an analytical concept: it depends on how you define it in the SAP. Different analyses can have different baselines.

For GLPX-001 you would need at minimum: DM, AE, LB, VS, EX, DS, CM, MH.

1.10 What’s Next

In Module 2 we get hands-on with SDTM. We will:

Look at the simulated raw GLPX-001 data
Map it to SDTM domains following the rules
Write the actual SAS and R code to produce a DM domain and an LB domain
Understand what a SDTM Reviewer’s Guide (SDRG) is

This course is open source and free forever. Found an error or want to contribute? Open an issue or pull request on GitHub.