1  Module 1: CDISC Standards Overview

Clinical Data Science for Pharma: CDISC From Scratch

Author

Sufyan Suleman

Published

May 17, 2026

1.1 What is CDISC and Why Does It Exist?

Imagine 50 hospitals across 10 countries collecting data for the same clinical trial. Each hospital uses a different database, different variable names, different date formats, different units. At the end of the trial, you have 50 different data structures that all mean the same thing.

The FDA receives hundreds of drug submissions every year. Without a common format, reviewing the data would require learning a new data structure for every single submission. That is not scalable.

CDISC (Clinical Data Interchange Standards Consortium) solves this. It is a non-profit organisation that defines exactly how clinical trial data must be structured before submission to regulators.

Since 2016, the FDA requires CDISC-compliant data (SDTM + ADaM) for all new drug and biologics submissions. The EMA (Europe) follows similar guidance.

This means: if you want to submit a drug in the US or Europe, you must speak CDISC. And if you work at a pharma company, you speak CDISC every day.


1.2 The CDISC Family of Standards

CDISC is not one standard; it is a family. Here are the ones you will encounter most:

Standard Full Name What it covers
CDASH Clinical Data Acquisition Standards Harmonization How to design CRFs (data collection forms)
SDTM Study Data Tabulation Model How to structure collected trial data
ADaM Analysis Data Model How to structure analysis-ready data
Define-XML Machine-readable data dictionary for submissions
SENDIG Standard for Exchange of Non-clinical Data Pre-clinical (animal) studies

In this course we focus on SDTM and ADaM. These are the two standards that statistical programmers and biostatisticians work with daily.


1.3 SDTM: Study Data Tabulation Model

1.3.1 What SDTM is

SDTM defines how collected clinical trial data is organised for submission. Think of it as a standardised filing system.

All data is organised into domains; each domain is a dataset covering one type of data. Every domain has a 2-letter code.

1.3.2 The most important SDTM domains

Domain Code What it contains
Demographics DM Age, sex, race, country, treatment arm
Adverse Events AE Side effects reported during the trial
Laboratory Tests LB Blood tests, urine tests, lab results
Vital Signs VS Blood pressure, heart rate, weight, height
Concomitant Medications CM Other drugs taken alongside trial drug
Medical History MH Pre-existing conditions
Exposure EX What drug was given, when, at what dose
Disposition DS Who completed or discontinued and why
Findings About FA Qualitative results linked to other domains

For our GLPX-001 trial, the most important domains will be: DM, AE, LB, VS, EX, DS

1.3.3 SDTM structure rules

Every SDTM domain must follow strict rules. The key ones:

1. Every dataset has required variables

For example, every SDTM domain must have:

Variable Meaning
STUDYID Study identifier (e.g., “GLPX-001”)
DOMAIN 2-letter domain code (e.g., “DM”, “AE”)
USUBJID Unique subject identifier: the key that links all domains
--SEQ Sequence number (e.g., AESEQ, LBSEQ)

2. Variable names follow naming conventions

SDTM variable names are constructed from a prefix (the domain code) and a suffix that describes the content. For example in the AE domain:

Variable Meaning
AEDECOD Adverse event decoded term (standardised MedDRA term)
AESTDTC Adverse event start date (ISO 8601 format)
AEENDTC Adverse event end date
AESEV Severity (MILD, MODERATE, SEVERE)
AESER Serious adverse event flag (Y/N)
AEREL Relationship to study drug

3. Dates are always ISO 8601

All dates in SDTM use the format YYYY-MM-DD. Partial dates are allowed: 2024-03 means March 2024 when the exact day is unknown.

4. Controlled terminology

Many variables have a fixed list of allowed values. You cannot invent your own. For example:

  • SEX: M, F, U, UNDIFFERENTIATED
  • RACE: WHITE, BLACK OR AFRICAN AMERICAN, ASIAN, etc.
  • AESEV: MILD, MODERATE, SEVERE

These come from the CDISC Controlled Terminology list, published and updated regularly at evs.nci.nih.gov/ftp1/CDISC/

1.3.4 What SDTM looks like: DM domain example

Here is what a Demographics (DM) domain looks like for 5 subjects from our GLPX-001 trial:

STUDYID DOMAIN USUBJID AGE SEX RACE ARMCD ARM
GLPX-001 DM GLPX-001-001-0001 54 M WHITE DRUG1 Drug 1mg
GLPX-001 DM GLPX-001-001-0002 62 F WHITE DRUG2 Drug 2mg
GLPX-001 DM GLPX-001-001-0003 48 F ASIAN PLACEBO Placebo
GLPX-001 DM GLPX-001-001-0004 71 M BLACK OR AFRICAN AMERICAN DRUG1 Drug 1mg
GLPX-001 DM GLPX-001-001-0005 39 F WHITE DRUG2 Drug 2mg

Notice:

  • USUBJID is globally unique; it encodes study + site + subject
  • ARMCD is the short code, ARM is the full label
  • Every row is one subject (DM is a one-row-per-subject domain)

1.3.5 What SDTM looks like: LB domain example

Laboratory data is different; one subject has many lab results over time. So LB has one row per test per timepoint per subject:

STUDYID DOMAIN USUBJID LBSEQ LBTESTCD LBTEST LBORRES LBORRESU LBDTC
GLPX-001 LB GLPX-001-001-0001 1 HBA1C Hemoglobin A1C 8.2 % 2024-01-15
GLPX-001 LB GLPX-001-001-0001 2 GLUC Glucose 9.1 mmol/L 2024-01-15
GLPX-001 LB GLPX-001-001-0001 3 HBA1C Hemoglobin A1C 7.4 % 2024-04-15
GLPX-001 LB GLPX-001-001-0002 1 HBA1C Hemoglobin A1C 9.0 % 2024-01-15

Notice:

  • LBSEQ is the sequence number: unique per subject
  • LBTESTCD is the short code, LBTEST is the full name
  • LBORRES / LBORRESU = original result and unit as reported
  • LBDTC = date of the lab test in ISO 8601 format

1.4 ADaM: Analysis Data Model

1.4.1 What ADaM is

SDTM is for tabulating what was collected. ADaM is for analysing it.

ADaM datasets are derived from SDTM and are designed specifically to support statistical analyses and TLF production. They contain derived variables that do not exist in SDTM: things like:

  • Baseline values
  • Change from baseline
  • Analysis flags (which visits/records to include in analysis)
  • Imputed values

1.4.2 The traceability principle

The most important concept in ADaM is traceability: every derived value in ADaM must be traceable back to SDTM. You must always be able to answer: where did this number come from?

This is why regulators trust the analysis: they can follow the chain from raw data → SDTM → ADaM → TLF.

1.4.3 The most important ADaM datasets

Dataset What it contains
ADSL Subject-Level Analysis Dataset: one row per subject, all key baseline and treatment variables
ADAE Adverse Events: analysis-ready AE data
ADLB Laboratory Analysis Dataset: with baseline, change from baseline
ADEFF Efficacy dataset (trial-specific)
ADTTE Time-to-Event dataset (survival analyses)

ADSL is the most important. Almost every other ADaM dataset merges with ADSL to pick up subject-level variables. You always build ADSL first.

1.4.4 What ADaM looks like: ADSL example

Here is a partial ADSL for 3 subjects:

USUBJID AGE SEX RACE BMI TRT01P RANDDT HBA1CBL SAFFL ITTFL
GLPX-001-001-0001 54 M WHITE 31.2 Drug 1mg 2024-01-15 8.2 Y Y
GLPX-001-001-0002 62 F WHITE 28.9 Drug 2mg 2024-01-15 9.0 Y Y
GLPX-001-001-0003 48 F ASIAN 33.5 Placebo 2024-01-16 7.8 Y Y

New variables that did not exist in SDTM:

Variable Meaning
TRT01P Planned treatment for Period 1
RANDDT Randomisation date
HBA1CBL HbA1c at baseline (derived from LB)
SAFFL Safety population flag (Y/N)
ITTFL Intent-to-treat population flag (Y/N)

1.4.5 What ADaM looks like: ADLB example

USUBJID PARAMCD PARAM AVISIT ADT AVAL BASE CHG ANL01FL
GLPX-001-001-0001 HBA1C HbA1c (%) Baseline 2024-01-15 8.2 8.2 0 Y
GLPX-001-001-0001 HBA1C HbA1c (%) Week 26 2024-07-15 7.1 8.2 -1.1 Y
GLPX-001-001-0001 HBA1C HbA1c (%) Week 52 2025-01-15 6.8 8.2 -1.4 Y

Key ADaM variables:

Variable Meaning
PARAMCD Parameter code (what is being measured)
PARAM Parameter label
AVISIT Analysis visit label
ADT Analysis date
AVAL Analysis value (the actual number used in analysis)
BASE Baseline value
CHG Change from baseline (AVAL − BASE)
ANL01FL Analysis flag: Y means include this record in the primary analysis

1.5 TLFs: Tables, Listings, and Figures

TLFs are the outputs: what the statistician and medical writer use to write the Clinical Study Report. They are produced from ADaM datasets.

1.5.1 Tables

Structured summaries. Examples:

  • Table 14.1.1: Summary of Demographics and Baseline Characteristics
  • Table 14.2.1: Primary Efficacy: Change from Baseline in HbA1c at Week 52
  • Table 14.3.1: Overview of Adverse Events

A demographics table looks like this:

Table 14.1.1 Summary of Demographics (Safety Population)

                          Drug 1mg     Drug 2mg     Placebo      Total
                          (N=300)      (N=300)      (N=300)      (N=900)
─────────────────────────────────────────────────────────────────────────
Age (years)
  Mean (SD)               54.2 (9.1)   53.8 (8.7)   54.5 (9.3)   54.2 (9.0)
  Median                  54.0         53.0          55.0         54.0
  Min, Max                28, 75       31, 74        29, 75       28, 75

Sex, n (%)
  Male                    152 (50.7)   148 (49.3)   155 (51.7)   455 (50.6)
  Female                  148 (49.3)   152 (50.7)   145 (48.3)   445 (49.4)

HbA1c at Baseline (%)
  Mean (SD)               8.4 (0.7)    8.3 (0.8)    8.4 (0.7)    8.4 (0.7)
─────────────────────────────────────────────────────────────────────────

1.5.2 Listings

Raw data printed in a structured format; used for data verification. Example: Listing of all serious adverse events, one row per event.

1.5.3 Figures

Plots. Examples:

  • Mean HbA1c over time by treatment arm (line plot)
  • Forest plot of subgroup analyses
  • Kaplan-Meier survival curves

1.6 How SDTM, ADaM and TLFs Connect

Let us trace our primary endpoint: Change from Baseline in HbA1c at Week 52; through the entire pipeline:

EDC (raw)
  └── Subject 001 had HbA1c = 8.2% on 2024-01-15
  └── Subject 001 had HbA1c = 6.8% on 2025-01-15

        │
        ▼

SDTM: LB domain
  └── USUBJID=GLPX-001-001-0001, LBTESTCD=HBA1C, LBORRES=8.2, LBDTC=2024-01-15
  └── USUBJID=GLPX-001-001-0001, LBTESTCD=HBA1C, LBORRES=6.8, LBDTC=2025-01-15

        │
        ▼

ADaM: ADLB
  └── USUBJID=GLPX-001-001-0001, PARAMCD=HBA1C, AVISIT=Baseline, AVAL=8.2, BASE=8.2, CHG=0
  └── USUBJID=GLPX-001-001-0001, PARAMCD=HBA1C, AVISIT=Week 52,  AVAL=6.8, BASE=8.2, CHG=-1.4, ANL01FL=Y

        │
        ▼

TLF: Table 14.2.1
  └── Drug 1mg: Mean change from baseline in HbA1c at Week 52 = -1.4% (SD 0.6)
  └── Placebo:  Mean change from baseline in HbA1c at Week 52 = -0.3% (SD 0.5)
  └── Treatment difference: -1.1% (95% CI: -1.3, -0.9), p < 0.001

This chain, from a single lab value entered by a nurse at a hospital to a p-value in a regulatory submission, is what you are learning to build.


1.7 Quick Reference: SDTM vs ADaM

Feature SDTM ADaM
Purpose Tabulate collected data Support analysis
Source Raw EDC data SDTM datasets
Structure One domain per data type One dataset per analysis need
Derived variables Minimal Many (BASE, CHG, flags)
Required by FDA Yes Yes
Key datasets DM, AE, LB, VS, EX ADSL, ADAE, ADLB
Row structure One row per observation One row per subject per parameter per visit

1.8 Module 1 Summary

NoteKey takeaways
  • CDISC is the mandatory standard for FDA/EMA drug submissions since 2016
  • SDTM organises collected data into standardised domains (DM, AE, LB…)
  • ADaM creates analysis-ready datasets from SDTM (ADSL, ADLB, ADAE…)
  • TLFs are the outputs generated from ADaM for the clinical study report
  • Every value in a TLF must be traceable back through ADaM to SDTM to raw data
  • USUBJID is the key that links every domain and every dataset together

1.9 Your Tasks Before Module 2

Note✅ Checklist

Answer to the checklist questions (reveal after you have thought about it):

LBORRES is the original result exactly as reported (a character string like “8.2”). AVAL is the numeric analysis value after cleaning, standardising units, and applying any imputation rules.

BASE does not exist in SDTM because baseline is an analytical concept: it depends on how you define it in the SAP. Different analyses can have different baselines.

For GLPX-001 you would need at minimum: DM, AE, LB, VS, EX, DS, CM, MH.


1.10 What’s Next

In Module 2 we get hands-on with SDTM. We will:

  • Look at the simulated raw GLPX-001 data
  • Map it to SDTM domains following the rules
  • Write the actual SAS and R code to produce a DM domain and an LB domain
  • Understand what a SDTM Reviewer’s Guide (SDRG) is

This course is open source and free forever. Found an error or want to contribute? Open an issue or pull request on GitHub.