2 Module 2: SDTM Deep Dive

Clinical Data Science for Pharma: CDISC From Scratch

Author

cdisc-from-scratch contributors

Published

May 17, 2026

2.1 What We Do in This Module

In Module 1 we learned what SDTM is. In this module we actually build it.

By the end of this module you will:

Understand the raw data you start with (simulated EDC export)
Know the rules for mapping raw data to SDTM
Build a DM domain (Demographics) in both SAS and R
Build an LB domain (Laboratory) in both SAS and R
Understand what a Reviewer’s Guide (SDRG) is

We will use our running trial: GLPX-001.

2.2 Step 1: The Raw Data

Before SDTM exists, you have raw data exported from the EDC system. This is messy, inconsistent, and not submission-ready.

Here is what a raw demographics export might look like from GLPX-001:

pt_id	site	rand_date	age	gender	ethnic	weight_kg	height_cm	trt_grp
0001	001	15Jan2024	54	Male	Caucasian	92.3	172	A
0002	001	15Jan2024	62	Female	Caucasian	78.1	165	B
0003	001	16Jan2024	48	Female	Asian	88.5	162	P
0004	002	17Jan2024	71	Male	African American	95.0	178	A
0005	002	17Jan2024	39	Female	Caucasian	70.2	160	B

Problems with this raw data:

pt_id is not globally unique — site 001 and site 002 both have a 0001
gender uses “Male/Female” — SDTM requires M/F
ethnic uses free text — SDTM requires controlled terminology
rand_date is in SAS date format (15Jan2024) — SDTM requires ISO 8601
trt_grp uses A/B/P codes — SDTM needs the full arm description
No STUDYID, no DOMAIN column

Mapping is the process of converting this raw data to SDTM. This is the core job of a SDTM programmer.

2.3 Step 2: Create the Simulated Raw Data

Before mapping, we need our raw data to exist. Let us create it in both SAS and R so we have something to work with throughout the module.

2.3.1 Create raw data in SAS

/*=============================================================
  GLPX-001 | Module 2 | Create simulated raw demographics data
=============================================================*/

data raw.demographics;
  infile datalines delimiter='|' missover;
  input pt_id $ site $ rand_date $ age gender $ ethnic $
        weight_kg height_cm trt_grp $;
  datalines;
0001|001|15Jan2024|54|Male|Caucasian|92.3|172|A
0002|001|15Jan2024|62|Female|Caucasian|78.1|165|B
0003|001|16Jan2024|48|Female|Asian|88.5|162|P
0004|002|17Jan2024|71|Male|African American|95.0|178|A
0005|002|17Jan2024|39|Female|Caucasian|70.2|160|B
0006|002|18Jan2024|58|Male|Hispanic|88.0|175|P
0007|003|18Jan2024|45|Female|Caucasian|82.3|168|A
0008|003|19Jan2024|67|Male|Asian|97.1|170|B
0009|003|19Jan2024|52|Female|African American|76.4|163|P
0010|003|20Jan2024|44|Male|Caucasian|91.5|180|A
;
run;

/* Raw lab data */
data raw.labs;
  infile datalines delimiter='|' missover;
  input pt_id $ site $ lab_date $ test_name $ result unit $
        visit_num visit_name $;
  datalines;
0001|001|15Jan2024|HbA1c|8.2|%|1|Screening
0001|001|15Jan2024|Glucose|9.1|mmol/L|1|Screening
0001|001|15Apr2024|HbA1c|7.6|%|2|Week 13
0001|001|15Jul2024|HbA1c|7.1|%|3|Week 26
0001|001|15Jan2025|HbA1c|6.8|%|4|Week 52
0002|001|15Jan2024|HbA1c|9.0|%|1|Screening
0002|001|15Jan2024|Glucose|10.2|mmol/L|1|Screening
0002|001|15Apr2024|HbA1c|8.3|%|2|Week 13
0002|001|15Jul2024|HbA1c|7.8|%|3|Week 26
0002|001|15Jan2025|HbA1c|7.2|%|4|Week 52
0003|001|16Jan2024|HbA1c|7.8|%|1|Screening
0003|001|16Jan2024|Glucose|8.9|mmol/L|1|Screening
0003|001|16Apr2024|HbA1c|7.9|%|2|Week 13
0003|001|16Jul2024|HbA1c|7.7|%|3|Week 26
0003|001|16Jan2025|HbA1c|7.5|%|4|Week 52
;
run;

2.3.2 Create raw data in R

library(tidyverse)
library(lubridate)

# Raw demographics
raw_demographics <- tribble(
  ~pt_id, ~site, ~rand_date,    ~age, ~gender,            ~ethnic,              ~weight_kg, ~height_cm, ~trt_grp,
  "0001", "001", "15Jan2024",    54,  "Male",              "Caucasian",           92.3,       172,        "A",
  "0002", "001", "15Jan2024",    62,  "Female",            "Caucasian",           78.1,       165,        "B",
  "0003", "001", "16Jan2024",    48,  "Female",            "Asian",               88.5,       162,        "P",
  "0004", "002", "17Jan2024",    71,  "Male",              "African American",    95.0,       178,        "A",
  "0005", "002", "17Jan2024",    39,  "Female",            "Caucasian",           70.2,       160,        "B",
  "0006", "002", "18Jan2024",    58,  "Male",              "Hispanic",            88.0,       175,        "P",
  "0007", "003", "18Jan2024",    45,  "Female",            "Caucasian",           82.3,       168,        "A",
  "0008", "003", "19Jan2024",    67,  "Male",              "Asian",               97.1,       170,        "B",
  "0009", "003", "19Jan2024",    52,  "Female",            "African American",    76.4,       163,        "P",
  "0010", "003", "20Jan2024",    44,  "Male",              "Caucasian",           91.5,       180,        "A"
)

# Raw lab data
raw_labs <- tribble(
  ~pt_id, ~site, ~lab_date,    ~test_name, ~result, ~unit,    ~visit_num, ~visit_name,
  "0001", "001", "15Jan2024",  "HbA1c",    8.2,     "%",      1,          "Screening",
  "0001", "001", "15Jan2024",  "Glucose",  9.1,     "mmol/L", 1,          "Screening",
  "0001", "001", "15Apr2024",  "HbA1c",    7.6,     "%",      2,          "Week 13",
  "0001", "001", "15Jul2024",  "HbA1c",    7.1,     "%",      3,          "Week 26",
  "0001", "001", "15Jan2025",  "HbA1c",    6.8,     "%",      4,          "Week 52",
  "0002", "001", "15Jan2024",  "HbA1c",    9.0,     "%",      1,          "Screening",
  "0002", "001", "15Jan2024",  "Glucose",  10.2,    "mmol/L", 1,          "Screening",
  "0002", "001", "15Apr2024",  "HbA1c",    8.3,     "%",      2,          "Week 13",
  "0002", "001", "15Jul2024",  "HbA1c",    7.8,     "%",      3,          "Week 26",
  "0002", "001", "15Jan2025",  "HbA1c",    7.2,     "%",      4,          "Week 52",
  "0003", "001", "16Jan2024",  "HbA1c",    7.8,     "%",      1,          "Screening",
  "0003", "001", "16Jan2024",  "Glucose",  8.9,     "mmol/L", 1,          "Screening",
  "0003", "001", "16Apr2024",  "HbA1c",    7.9,     "%",      2,          "Week 13",
  "0003", "001", "16Jul2024",  "HbA1c",    7.7,     "%",      3,          "Week 26",
  "0003", "001", "16Jan2025",  "HbA1c",    7.5,     "%",      4,          "Week 52"
)

glimpse(raw_demographics)
glimpse(raw_labs)

2.4 Step 3: Build the DM Domain

The Demographics (DM) domain is always built first because every other domain needs USUBJID — which is constructed here.

2.4.1 The mapping rules for DM

SDTM Variable	Source	Rule
`STUDYID`	Hardcoded	“GLPX-001”
`DOMAIN`	Hardcoded	“DM”
`USUBJID`	Constructed	STUDYID + “-” + site + “-” + pt_id
`SUBJID`	pt_id	As-is
`SITEID`	site	As-is
`AGE`	age	As-is (numeric)
`AGEU`	Hardcoded	“YEARS”
`SEX`	gender	Male → M, Female → F
`RACE`	ethnic	Map to CDISC controlled terminology
`RFSTDTC`	rand_date	Convert to ISO 8601 (YYYY-MM-DD)
`ARMCD`	trt_grp	A → DRUG1, B → DRUG2, P → PLACEBO
`ARM`	trt_grp	A → “Drug 1mg”, B → “Drug 2mg”, P → “Placebo”
`ACTARMCD`	trt_grp	Same as ARMCD (actual = planned here)
`ACTARM`	trt_grp	Same as ARM

2.4.2 Build DM in SAS

/*=============================================================
  GLPX-001 | Module 2 | Build SDTM DM domain
=============================================================*/

/* Step 1: Define study-level constants */
%let studyid = GLPX-001;

/* Step 2: Map raw demographics to SDTM DM */
data sdtm.dm;
  set raw.demographics;

  /* Study and domain */
  STUDYID = "&studyid";
  DOMAIN  = "DM";

  /* Unique subject identifier */
  USUBJID = cats(STUDYID, "-", site, "-", pt_id);
  SUBJID  = pt_id;
  SITEID  = site;

  /* Age */
  AGE  = age;
  AGEU = "YEARS";

  /* Sex: map to CDISC controlled terminology */
  if upcase(gender) = "MALE"   then SEX = "M";
  else if upcase(gender) = "FEMALE" then SEX = "F";
  else SEX = "U";

  /* Race: map to CDISC controlled terminology */
  select (upcase(ethnic));
    when ("CAUCASIAN")          RACE = "WHITE";
    when ("ASIAN")              RACE = "ASIAN";
    when ("AFRICAN AMERICAN")   RACE = "BLACK OR AFRICAN AMERICAN";
    when ("HISPANIC")           RACE = "WHITE";  /* Hispanic is ETHNIC not RACE */
    otherwise                   RACE = "UNKNOWN";
  end;

  /* Ethnicity (separate from race in CDISC) */
  if upcase(ethnic) = "HISPANIC" then ETHNIC = "HISPANIC OR LATINO";
  else ETHNIC = "NOT HISPANIC OR LATINO";

  /* Randomisation date: convert SAS date literal to ISO 8601 */
  _rand_date = input(rand_date, date9.);
  RFSTDTC = put(_rand_date, yymmdd10.);  /* Produces YYYY-MM-DD */

  /* Treatment arm */
  select (trt_grp);
    when ("A") do; ARMCD = "DRUG1";   ARM = "Drug 1mg";  end;
    when ("B") do; ARMCD = "DRUG2";   ARM = "Drug 2mg";  end;
    when ("P") do; ARMCD = "PLACEBO"; ARM = "Placebo";   end;
    otherwise  do; ARMCD = "";        ARM = "";           end;
  end;

  /* Actual arm = planned arm (no switches in this trial) */
  ACTARMCD = ARMCD;
  ACTARM   = ARM;

  /* Keep only SDTM variables — drop raw source variables */
  keep STUDYID DOMAIN USUBJID SUBJID SITEID
       AGE AGEU SEX RACE ETHNIC RFSTDTC
       ARMCD ARM ACTARMCD ACTARM;

run;

/* Step 3: Sort by USUBJID (required for SDTM submission) */
proc sort data=sdtm.dm;
  by USUBJID;
run;

/* Step 4: Quick check */
proc print data=sdtm.dm (obs=5);
  title "SDTM DM Domain — First 5 Records";
run;

2.4.3 Build DM in R

library(tidyverse)
library(lubridate)

# Helper: parse SAS-style dates like "15Jan2024" to Date
parse_sas_date <- function(x) {
  as.Date(x, format = "%d%b%Y")
}

# Helper: format Date to ISO 8601 character
to_iso8601 <- function(x) {
  format(x, "%Y-%m-%d")
}

# Map race to CDISC controlled terminology
map_race <- function(ethnic) {
  case_when(
    str_to_upper(ethnic) == "CAUCASIAN"         ~ "WHITE",
    str_to_upper(ethnic) == "ASIAN"             ~ "ASIAN",
    str_to_upper(ethnic) == "AFRICAN AMERICAN"  ~ "BLACK OR AFRICAN AMERICAN",
    str_to_upper(ethnic) == "HISPANIC"          ~ "WHITE",
    TRUE                                         ~ "UNKNOWN"
  )
}

# Map ethnicity to CDISC controlled terminology
map_ethnic <- function(ethnic) {
  case_when(
    str_to_upper(ethnic) == "HISPANIC" ~ "HISPANIC OR LATINO",
    TRUE                                ~ "NOT HISPANIC OR LATINO"
  )
}

# Build DM domain
sdtm_dm <- raw_demographics |>
  mutate(
    STUDYID  = "GLPX-001",
    DOMAIN   = "DM",
    USUBJID  = paste(STUDYID, site, pt_id, sep = "-"),
    SUBJID   = pt_id,
    SITEID   = site,
    AGE      = age,
    AGEU     = "YEARS",
    SEX      = case_when(
                 str_to_upper(gender) == "MALE"   ~ "M",
                 str_to_upper(gender) == "FEMALE" ~ "F",
                 TRUE                              ~ "U"
               ),
    RACE     = map_race(ethnic),
    ETHNIC   = map_ethnic(ethnic),
    RFSTDTC  = to_iso8601(parse_sas_date(rand_date)),
    ARMCD    = case_when(
                 trt_grp == "A" ~ "DRUG1",
                 trt_grp == "B" ~ "DRUG2",
                 trt_grp == "P" ~ "PLACEBO"
               ),
    ARM      = case_when(
                 trt_grp == "A" ~ "Drug 1mg",
                 trt_grp == "B" ~ "Drug 2mg",
                 trt_grp == "P" ~ "Placebo"
               ),
    ACTARMCD = ARMCD,
    ACTARM   = ARM
  ) |>
  # Keep only SDTM variables
  select(STUDYID, DOMAIN, USUBJID, SUBJID, SITEID,
         AGE, AGEU, SEX, RACE, ETHNIC, RFSTDTC,
         ARMCD, ARM, ACTARMCD, ACTARM) |>
  # Sort by USUBJID
  arrange(USUBJID)

# Preview
print(sdtm_dm)

2.4.4 What the DM domain looks like

After mapping, the first 3 rows of sdtm.dm should look like:

STUDYID	DOMAIN	USUBJID	AGE	SEX	RACE	ETHNIC	RFSTDTC	ARMCD	ARM
GLPX-001	DM	GLPX-001-001-0001	54	M	WHITE	NOT HISPANIC OR LATINO	2024-01-15	DRUG1	Drug 1mg
GLPX-001	DM	GLPX-001-001-0002	62	F	WHITE	NOT HISPANIC OR LATINO	2024-01-15	DRUG2	Drug 2mg
GLPX-001	DM	GLPX-001-001-0003	48	F	ASIAN	NOT HISPANIC OR LATINO	2024-01-16	PLACEBO	Placebo

Common mistake

HISPANIC is an ethnicity in CDISC, not a race. The RACE variable uses categories like WHITE, ASIAN, BLACK OR AFRICAN AMERICAN. ETHNIC is a separate variable. Always check the CDISC Controlled Terminology list when you are unsure.

2.5 Step 4: Build the LB Domain

Laboratory (LB) is one of the most complex SDTM domains because subjects have many lab results at many timepoints.

2.5.1 The mapping rules for LB

SDTM Variable	Source	Rule
`STUDYID`	Hardcoded	“GLPX-001”
`DOMAIN`	Hardcoded	“LB”
`USUBJID`	Constructed	Same rule as DM
`LBSEQ`	Derived	Sequence number within subject
`LBTESTCD`	test_name	HbA1c → HBA1C, Glucose → GLUC
`LBTEST`	test_name	Full label
`LBORRES`	result	As character string (original result)
`LBORRESU`	unit	Original unit
`LBSTRESC`	result	Standardised result (character)
`LBSTRESN`	result	Standardised result (numeric)
`LBSTRESU`	unit	Standardised unit
`LBDTC`	lab_date	ISO 8601
`VISITNUM`	visit_num	As-is
`VISIT`	visit_name	As-is

LBORRES vs LBSTRESC vs LBSTRESN

LBORRES — the result exactly as reported (character). Could be “8.2” or “>10” or “POSITIVE”
LBSTRESC — the result after standardisation (character). Units converted to a standard
LBSTRESN — the numeric version of LBSTRESC (only populated for numeric results)

In our simple trial all units are already standard so they will be the same. In real trials they often differ (e.g., glucose reported in mg/dL at some sites, mmol/L at others — standardise to one unit in LBSTRESC/LBSTRESN).

2.5.2 Build LB in SAS

/*=============================================================
  GLPX-001 | Module 2 | Build SDTM LB domain
=============================================================*/

data sdtm.lb_pre;
  set raw.labs;

  STUDYID = "&studyid";
  DOMAIN  = "LB";
  USUBJID = cats(STUDYID, "-", site, "-", pt_id);

  /* Test codes and labels */
  select (upcase(test_name));
    when ("HBA1C")   do;
      LBTESTCD = "HBA1C";
      LBTEST   = "Hemoglobin A1C";
    end;
    when ("GLUCOSE")  do;
      LBTESTCD = "GLUC";
      LBTEST   = "Glucose";
    end;
    otherwise do;
      LBTESTCD = upcase(test_name);
      LBTEST   = test_name;
    end;
  end;

  /* Original result and unit */
  LBORRES  = put(result, best12.);  /* Convert numeric to character */
  LBORRESU = unit;

  /* Standardised result (same as original here — units already standard) */
  LBSTRESC = LBORRES;
  LBSTRESN = result;
  LBSTRESU = unit;

  /* Date */
  _lab_date = input(lab_date, date9.);
  LBDTC = put(_lab_date, yymmdd10.);

  /* Visit */
  VISITNUM = visit_num;
  VISIT    = visit_name;

  keep STUDYID DOMAIN USUBJID LBTESTCD LBTEST
       LBORRES LBORRESU LBSTRESC LBSTRESN LBSTRESU
       LBDTC VISITNUM VISIT;
run;

/* Add sequence number within subject */
proc sort data=sdtm.lb_pre;
  by USUBJID LBDTC LBTESTCD;
run;

data sdtm.lb;
  set sdtm.lb_pre;
  by USUBJID;
  if first.USUBJID then LBSEQ = 0;
  LBSEQ + 1;
run;

proc print data=sdtm.lb (obs=8);
  title "SDTM LB Domain — First 8 Records";
run;

2.5.3 Build LB in R

# Test code mapping
map_testcd <- function(test_name) {
  case_when(
    str_to_upper(test_name) == "HBA1C"   ~ "HBA1C",
    str_to_upper(test_name) == "GLUCOSE" ~ "GLUC",
    TRUE                                  ~ str_to_upper(test_name)
  )
}

map_test <- function(test_name) {
  case_when(
    str_to_upper(test_name) == "HBA1C"   ~ "Hemoglobin A1C",
    str_to_upper(test_name) == "GLUCOSE" ~ "Glucose",
    TRUE                                  ~ test_name
  )
}

# Build LB domain
sdtm_lb <- raw_labs |>
  mutate(
    STUDYID  = "GLPX-001",
    DOMAIN   = "LB",
    USUBJID  = paste(STUDYID, site, pt_id, sep = "-"),
    LBTESTCD = map_testcd(test_name),
    LBTEST   = map_test(test_name),
    LBORRES  = as.character(result),   # Original result as character
    LBORRESU = unit,
    LBSTRESC = as.character(result),   # Standardised (same here)
    LBSTRESN = result,                 # Numeric standardised
    LBSTRESU = unit,
    LBDTC    = to_iso8601(parse_sas_date(lab_date)),
    VISITNUM = visit_num,
    VISIT    = visit_name
  ) |>
  # Sort and add sequence number within subject
  arrange(USUBJID, LBDTC, LBTESTCD) |>
  group_by(USUBJID) |>
  mutate(LBSEQ = row_number()) |>
  ungroup() |>
  select(STUDYID, DOMAIN, USUBJID, LBSEQ,
         LBTESTCD, LBTEST,
         LBORRES, LBORRESU, LBSTRESC, LBSTRESN, LBSTRESU,
         LBDTC, VISITNUM, VISIT)

print(sdtm_lb, n = 8)

2.5.4 What the LB domain looks like

USUBJID	LBSEQ	LBTESTCD	LBORRES	LBSTRESU	LBDTC	VISIT
GLPX-001-001-0001	1	GLUC	9.1	mmol/L	2024-01-15	Screening
GLPX-001-001-0001	2	HBA1C	8.2	%	2024-01-15	Screening
GLPX-001-001-0001	3	HBA1C	7.6	%	2024-04-15	Week 13
GLPX-001-001-0001	4	HBA1C	7.1	%	2024-07-15	Week 26
GLPX-001-001-0001	5	HBA1C	6.8	%	2025-01-15	Week 52

2.6 Step 5: Validate Your SDTM

After building SDTM datasets you must check them. Regulators run their own checks — so you run yours first. Key things to verify:

2.6.1 Checks to always run

1. Is USUBJID consistent across domains?

Every subject in LB must also be in DM. No orphan records.

/* SAS: Check all LB subjects exist in DM */
proc sql;
  select distinct lb.USUBJID
  from sdtm.lb as lb
  left join sdtm.dm as dm on lb.USUBJID = dm.USUBJID
  where dm.USUBJID is null;
quit;
/* Should return 0 rows */

# R: Check all LB subjects exist in DM
orphan_subjects <- sdtm_lb |>
  anti_join(sdtm_dm, by = "USUBJID") |>
  distinct(USUBJID)

if (nrow(orphan_subjects) == 0) {
  cat("✅ All LB subjects found in DM\n")
} else {
  cat("❌ Orphan subjects in LB:\n")
  print(orphan_subjects)
}

2. Are dates in ISO 8601 format?

/* SAS: Check date format */
data _null_;
  set sdtm.lb;
  if not prxmatch('/^\d{4}-\d{2}-\d{2}$/', LBDTC) then
    put "BAD DATE: " USUBJID= LBDTC=;
run;

# R: Check date format
bad_dates <- sdtm_lb |>
  filter(!str_detect(LBDTC, "^\\d{4}-\\d{2}-\\d{2}$"))

if (nrow(bad_dates) == 0) {
  cat("✅ All dates in ISO 8601 format\n")
} else {
  cat("❌ Bad dates found:\n")
  print(bad_dates |> select(USUBJID, LBDTC))
}

3. Are controlled terminology values valid?

# R: Check SEX values
valid_sex <- c("M", "F", "U", "UNDIFFERENTIATED")
bad_sex <- sdtm_dm |>
  filter(!SEX %in% valid_sex)

if (nrow(bad_sex) == 0) {
  cat("✅ All SEX values valid\n")
} else {
  cat("❌ Invalid SEX values:\n")
  print(bad_sex |> select(USUBJID, SEX))
}

# Check RACE values
valid_race <- c("WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN",
                "AMERICAN INDIAN OR ALASKA NATIVE",
                "NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER",
                "MULTIPLE", "UNKNOWN", "NOT REPORTED")
bad_race <- sdtm_dm |>
  filter(!RACE %in% valid_race)

if (nrow(bad_race) == 0) {
  cat("✅ All RACE values valid\n")
} else {
  cat("❌ Invalid RACE values:\n")
  print(bad_race |> select(USUBJID, RACE))
}

4. Is LBSEQ unique within subject?

duplicates <- sdtm_lb |>
  group_by(USUBJID, LBSEQ) |>
  filter(n() > 1)

if (nrow(duplicates) == 0) {
  cat("✅ LBSEQ is unique within subject\n")
} else {
  cat("❌ Duplicate LBSEQ values found\n")
  print(duplicates)
}

3 Exercise 2.1 — Build the DM domain

Guided Exercise: Fill in the blanks

In R — fill in the blanks:

sdtm_dm <- raw_demographics |>
  mutate(
    STUDYID = "GLPX-001",
    DOMAIN  = "DM",
    USUBJID = paste(STUDYID, site, pt_id, sep = "-"),
    SEX     = case_when(
      str_to_upper(gender) == "MALE"   ~ "M",
      str_to_upper(gender) == "FEMALE" ~ "F",
      TRUE                             ~ "U"
    ),
    RFSTDTC = to_iso8601(parse_sas_date(______)),
    ARMCD   = case_when(
      trt_grp == "A" ~ "______",
      trt_grp == "B" ~ "DRUG2",
      trt_grp == "P" ~ "PLACEBO"
    )
  )

In SAS — fill in the blanks:

data sdtm.dm;
  set raw.demographics;
  STUDYID = "GLPX-001";
  DOMAIN  = "DM";
  USUBJID = cats(STUDYID, "-", site, "-", pt_id);
  if upcase(gender) = "MALE"   then SEX = "M";
  else if upcase(gender) = "FEMALE" then SEX = "F";
  _rand_date = input(______, date9.);
  RFSTDTC   = put(_rand_date, yymmdd10.);
  if upcase(trt_grp) = "A" then ARMCD = "______";
run;

Solution — click to reveal

RFSTDTC = to_iso8601(parse_sas_date(rand_date)),
ARMCD   = case_when(
  trt_grp == "A" ~ "DRUG1",
  trt_grp == "B" ~ "DRUG2",
  trt_grp == "P" ~ "PLACEBO"
)

SAS:

_rand_date = input(rand_date, date9.);
if upcase(trt_grp) = "A" then ARMCD = "DRUG1";

4 Exercise 2.2 — Build the LB domain

Guided Exercise: Fill in the blanks

In R — fill in the blanks:

sdtm_lb <- raw_labs |>
  mutate(
    STUDYID  = "GLPX-001",
    DOMAIN   = "LB",
    USUBJID  = paste(STUDYID, site, pt_id, sep = "-"),
    LBTESTCD = case_when(
      str_to_upper(test_name) == "HBA1C"   ~ "______",
      str_to_upper(test_name) == "GLUCOSE" ~ "GLUC",
      TRUE                                  ~ str_to_upper(test_name)
    ),
    LBORRES  = as.character(result),
    LBSTRESN = result,
    LBDTC    = to_iso8601(parse_sas_date(lab_date))
  ) |>
  arrange(USUBJID, LBDTC, LBTESTCD) |>
  group_by(USUBJID) |>
  mutate(LBSEQ = row_number())

In SAS — fill in the blanks:

data sdtm.lb_pre;
  set raw.labs;
  STUDYID = "&studyid";
  DOMAIN  = "LB";
  USUBJID = cats(STUDYID, "-", site, "-", pt_id);
  select (upcase(test_name));
    when ("HBA1C") do;
      LBTESTCD = "______";
      LBTEST   = "Hemoglobin A1C";
    end;
    when ("GLUCOSE") do;
      LBTESTCD = "GLUC";
      LBTEST   = "Glucose";
    end;
  end;
run;

Solution — click to reveal

LBTESTCD = case_when(
  str_to_upper(test_name) == "HBA1C"   ~ "HBA1C",
  str_to_upper(test_name) == "GLUCOSE" ~ "GLUC",
  TRUE                                  ~ str_to_upper(test_name)
)

SAS:

when ("HBA1C") do;
  LBTESTCD = "HBA1C";
  LBTEST   = "Hemoglobin A1C";
end;

4.1 Step 6: Save Your SDTM Datasets

4.1.1 Save in SAS (as .xpt transport files for submission)

/* Save as SAS transport format (.xpt) — required for FDA submission */
libname xptout xport "/path/to/sdtm/dm.xpt";
proc copy in=sdtm out=xptout;
  select dm;
run;

libname xptout xport "/path/to/sdtm/lb.xpt";
proc copy in=sdtm out=xptout;
  select lb;
run;

4.1.2 Save in R (as .xpt using haven)

library(haven)

# Save as SAS transport format (v5 xpt — FDA requirement)
write_xpt(sdtm_dm, "data/sdtm/dm.xpt", version = 5, name = "DM")
write_xpt(sdtm_lb, "data/sdtm/lb.xpt", version = 5, name = "LB")

cat("SDTM datasets saved to data/sdtm/\n")

Why .xpt format?

The FDA requires submission datasets in SAS Version 5 transport format (.xpt). This is an old but stable format that any software can read. The haven package in R writes this format correctly with write_xpt().

4.2 The SDTM Reviewer’s Guide (SDRG)

When you submit SDTM to the FDA, you also submit a SDTM Reviewer’s Guide (SDRG) — a Word document that explains:

What datasets are included and why
Any deviations from the SDTM Implementation Guide
Custom domains (if any)
Key decisions made during mapping
Where to find important variables

The SDRG is what the FDA reviewer reads first before looking at any data. A clear SDRG makes review faster and reduces questions from regulators.

Example SDRG entry for our DM domain:

DM domain: The ARMCD variable was derived from the raw EDC field trt_grp using the mapping A=DRUG1, B=DRUG2, P=PLACEBO. Subjects with trt_grp = “P” received matching placebo. RFSTDTC was derived from the randomisation date. Hispanic subjects were coded as RACE=WHITE and ETHNIC=HISPANIC OR LATINO per CDISC controlled terminology.

4.3 Module 2 Summary

Key takeaways

Raw EDC data is messy — variable names, formats, and values must all be mapped
USUBJID is constructed from STUDYID + site + subject ID — always globally unique
Controlled terminology means you must use the exact CDISC-approved values
Dates are always ISO 8601 (YYYY-MM-DD) in SDTM
LBORRES = original result as character; LBSTRESN = standardised numeric
Always validate your SDTM — check USUBJID consistency, date formats, CT values
Submit datasets as .xpt (SAS Version 5 transport format)
The SDRG explains your mapping decisions to FDA reviewers

4.4 Your Tasks Before Module 3

✅ Checklist

Run the raw data creation code (SAS or R) — confirm it works
Run the DM mapping code — inspect the output
Run the LB mapping code — inspect the output
Run all 4 validation checks — confirm they all pass
Can you answer: why is LBORRES a character variable?
Can you answer: what would you do if glucose was reported in mg/dL at some sites and mmol/L at others?

Answers:

LBORRES is character because original results can be non-numeric: “POSITIVE”, “>10.0”, “< 0.5”. Making it character preserves exactly what was reported.

For the unit conversion: you would standardise to one unit (e.g., mmol/L) in LBSTRESN and LBSTRESU, while keeping the original values in LBORRES and LBORRESU. The conversion factor for glucose is: mg/dL ÷ 18.0182 = mmol/L.

4.5 What’s Next

In Module 3 we build ADaM datasets from the SDTM we just created. We will build ADSL (one row per subject, all baseline variables) and ADLB (with baseline, change from baseline, and analysis flags). This is where the real analytical work begins.

This course is open source and free forever. Found an error or want to contribute? Open an issue or pull request on GitHub.