Introduction to clinical trial data analysis

Foundations of CDISC data standards

Author

Sufyan Suleman

Course Overview

Welcome

This free course is built for biologists, computational biologists, and data analysts who want to transition into clinical data science in the pharmaceutical industry. It was created from the perspective of someone making that transition firsthand, translating CDISC standards from abstract documentation into practical, working understanding. Every concept is explained through the questions that naturally arise when moving from discovery research to regulated clinical trial data, with a focus on clarity, intuition, and real‑world relevance rather than textbook formalism.

By the end of this course you will be able to:

  • Understand how clinical trial data flows from collection to regulatory submission
  • Work with CDISC standards (SDTM and ADaM) in both SAS and R
  • Produce industry-standard Tables, Listings, and Figures (TLFs)
  • Read and interpret a Statistical Analysis Plan (SAP)
  • Understand the roles and workflow inside a pharma company.

What this course assumes: Some statistical background (you know what a mean, a p-value, and a regression are). No SAS experience required. No prior clinical trial experience required.


The Big Picture: What Happens to Clinical Trial Data?

Before touching any code or standard, we need to understand why all of this exists. Let us walk through a simplified story.

A drug enters a clinical trial

A pharmaceutical company is testing a new GLP-1 drug for type 2 diabetes. They recruit 2,000 patients across 50 hospitals in 10 countries. Over 2 years, they collect:

  • Demographics (age, sex, weight, ethnicity)

  • Lab results (HbA1c, fasting glucose, lipids) every 3 months A Phase III randomised, double-blind, placebo-controlled trial of a fictional GLP-1 receptor agonist in adults with type 2 diabetes.

  • Population: Adults 18–75 years, HbA1c 7.5–10%, BMI 27–40 kg/m²

  • Arms: Drug 1mg (n=300), Drug 2mg (n=300), Placebo (n=300)

  • Duration: 52 weeks

  • Primary endpoint: Change from baseline in HbA1c at week 52

  • Key secondary endpoints: Change in body weight, fasting glucose, BP

We chose a diabetes/obesity trial deliberately; it mirrors the kind of work done at several pharma companies on GLP-1 programs.


Setting Up Your Environment

You need three things: SAS, R/Quarto, and a way to run both.

SAS OnDemand for Academics (free)

SAS is the dominant tool for SDTM/ADaM programming at most pharma companies. SAS OnDemand for Academics is a free cloud-based version.

  1. Go to https://welcome.oda.sas.com
  2. Click Register; use your university email if you have one
  3. Once logged in, launch SAS Studio (the browser-based IDE)
  4. You do not need to install anything locally
TipTip

SAS Studio runs in the browser. You write code on the left, results appear on the right. It looks different from RStudio but the logic is similar.

R and RStudio

You likely already have this. If not:

  1. Install R from https://cran.r-project.org
  2. Install RStudio from https://posit.co/downloads

Key R packages we will use throughout the course:

install.packages(c(
    "tidyverse",   # data manipulation and plotting
    "haven",       # read SAS datasets (.sas7bdat, .xpt)
    "admiral",     # ADaM dataset building (pharmaverse)
    "pharmaversesdtm", # example SDTM data
    "gt",          # publication-quality tables
    "rtables",     # clinical-style tables
    "ggplot2",     # figures
    "lubridate"    # date handling
))

Quarto

This course is written in Quarto. To render these files locally:

  1. Install Quarto from https://quarto.org/docs/get-started
  2. In RStudio: open any .qmd file and click Render
  3. Or from the terminal: quarto render module-00-orientation.qmd

Verify your R setup

Run this to confirm everything is working:

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.1
Warning: package 'readr' was built under R version 4.4.1
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'dplyr' was built under R version 4.4.1
Warning: package 'stringr' was built under R version 4.4.1
Warning: package 'forcats' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)
Warning: package 'haven' was built under R version 4.4.3
# If this runs without errors, you are ready
cat("R version:", R.version$major, ".", R.version$minor, "\n")
R version: 4 . 4.0 
cat("tidyverse loaded successfully\n")
tidyverse loaded successfully
cat("haven loaded successfully\n")
haven loaded successfully

Verify your SAS setup

Paste this into SAS Studio and click Run:

/* Module 0: SAS verification */
%put SAS Version: &sysvlong;
%put Hello from SAS, setup is working;

data verify;
    name = "GLPX-001";
    status = "Ready";
    put name= status=;
run;

If you see output without errors, SAS is ready.


GitHub Repo Structure

The course lives at: https://github.com/sufyansuleman/cdisc-from-scratch

Here is how the repository is organised:

cdisc-from-scratch/
│
├── README.md                    ← Course overview and how to use it
│
├── modules/
│   ├── module-00-orientation.qmd       ← This file
│   ├── module-01-cdisc-overview.qmd
│   ├── module-02-sdtm.qmd
│   ├── module-03-adam.qmd
│   ├── module-04-tlfs.qmd
│   ├── module-05-full-pipeline.qmd
│   └── module-06-industry-context.qmd
│
├── data/
│   ├── raw/                     ← Simulated raw EDC data (GLPX-001)
│   ├── sdtm/                    ← SDTM datasets we build
│   └── adam/                    ← ADaM datasets we build
│
├── sas/
│   └── macros/                  ← Reusable SAS macros
│
├── r/
│   └── functions/               ← Reusable R functions
│
├── outputs/
│   └── tlfs/                    ← Generated tables and figures
│
└── _quarto.yml                  ← Quarto project config (renders as a website)

Your First Task

Before Module 1, do these three things:

Note✅ Checklist

What’s Next

In Module 1 we go deep on CDISC standards: what SDTM and ADaM actually look like, what the rules are, and why they are designed the way they are. We will look at real SDTM domain examples and start to understand the logic before writing any code.


This course is open source and free forever. Found an error or want to contribute? Open an issue or pull request on GitHub.