Calculate SDS

Scope

Standardize selected variables as SDS (z-scores) using supplied reference means and SDs. Supports row-retention policies, extreme SDS handling, and an optional diagnostic list output.

Derivation and intended use

calc_sds() applies the standard z-score equation directly: SDS = (x - mean) / sd.
The mean and sd values are taken from your supplied ref table; they are not estimated internally by the function.
This makes calc_sds() a general SDS engine suitable for many domains, provided your reference table is appropriate.

When to use

You need z-scores for one or more variables using a chosen reference.
You want explicit control over missing-row handling and extreme SDS treatment.
You may need a diagnostic list summarizing missing and extreme counts.

Requirements checklist

Packages: HealthMarkers, dplyr (for display).
Data columns: variables listed in vars must exist and be numeric/coercible; id_col optional.
Reference table ref: columns variable, mean, sd; one finite mean and positive sd per var.
Row policy: na_strategy = keep (default), omit, or error.
Extreme policy: check_extreme + extreme_strategy (cap/warn/error/NA) with sds_cap threshold.

Choosing reference values (important)

Use reference means/SDs from your own target cohort when available (best practice).
If using external references, choose the closest match by age, sex, ethnicity, assay/platform, and clinical context.
Keep units identical between data and ref; SDS is unit-sensitive if means/SDs are from different scales.
Report the source and version of your reference table in analysis outputs for reproducibility.

Load packages and example data

Build a small example reference for BMI and systolic BP; replace with your variables and reference stats.

library(HealthMarkers)
library(dplyr)

sim_path <- system.file("extdata", "simulated_hm_data.rds", package = "HealthMarkers")
sim <- readRDS(sim_path)
sim_small <- dplyr::slice_head(sim, n = 30) %>%
  dplyr::select(id, BMI, sbp)

vars <- c("BMI", "sbp")
ref <- tibble::tibble(
  variable = vars,
  mean = c(mean(sim_small$BMI, na.rm = TRUE), mean(sim_small$sbp, na.rm = TRUE)),
  sd   = c(sd(sim_small$BMI, na.rm = TRUE),   sd(sim_small$sbp, na.rm = TRUE))
)

Quick start: compute SDS

Defaults keep rows with missing inputs and return NA SDS for those cells.

sds_tbl <- calc_sds(
  data = sim_small,
  vars = vars,
  ref = ref,
  id_col = "id",
  na_strategy = "keep",
  check_extreme = TRUE,
  extreme_strategy = "cap",
  sds_cap = 6,
  verbose = FALSE
)

new_cols <- setdiff(names(sds_tbl), names(sim_small))
head(select(sds_tbl, id, all_of(new_cols)))
#> # A tibble: 6 × 3
#>   id    BMI_sds sbp_sds
#>   <chr>   <dbl>   <dbl>
#> 1 P001    0.696  -1.66 
#> 2 P002    0.279   0.482
#> 3 P003   -0.209  -1.06 
#> 4 P004   -0.354  -0.765
#> 5 P005   -0.734   0.732
#> 6 P006    1.53   -0.915

Arguments that matter

vars: character vector of columns to standardize; must be present and unique.
ref: table with variable/mean/sd; one row per var; sd must be positive.
na_strategy: keep (retain rows, NA SDS), omit (drop rows with any missing vars), error (abort if missing vars).
check_extreme + extreme_strategy: cap/warn/error/NA for |SDS| > sds_cap; cap is applied symmetrically.
return: data (default tibble) or list (adds summary and warning strings).
id_col: optional ID for diagnostics only.

Handling missing inputs

Non-numeric vars are coerced to numeric with warnings; non-finite become NA.
Bypass vs drop vs fail is controlled by na_strategy.

Compare row policies

demo <- sim_small[1:8, ]
demo$BMI[3] <- NA

a_keep <- calc_sds(demo, vars, ref, na_strategy = "keep")
a_omit <- calc_sds(demo, vars, ref, na_strategy = "omit")

list(
  keep_rows = nrow(a_keep),
  omit_rows = nrow(a_omit),
  preview = head(select(a_keep, BMI_sds, sbp_sds))
)
#> $keep_rows
#> [1] 8
#> 
#> $omit_rows
#> [1] 7
#> 
#> $preview
#> # A tibble: 6 × 2
#>   BMI_sds sbp_sds
#>     <dbl>   <dbl>
#> 1   0.696  -1.66 
#> 2   0.279   0.482
#> 3  NA      -1.06 
#> 4  -0.354  -0.765
#> 5  -0.734   0.732
#> 6   1.53   -0.915

Extreme-value handling

Cap, warn, error, or set NA for SDS beyond the chosen limit.

demo2 <- demo
demo2$sbp[5] <- 500  # extreme SBP

a_cap <- calc_sds(
  data = demo2,
  vars = vars,
  ref = ref,
  na_strategy = "keep",
  check_extreme = TRUE,
  extreme_strategy = "cap",
  sds_cap = 4
)

head(select(a_cap, BMI_sds, sbp_sds))
#> # A tibble: 6 × 2
#>   BMI_sds sbp_sds
#>     <dbl>   <dbl>
#> 1   0.696  -1.66 
#> 2   0.279   0.482
#> 3  NA      -1.06 
#> 4  -0.354  -0.765
#> 5  -0.734   4    
#> 6   1.53   -0.915

List output for diagnostics

Request per-variable missing/extreme counts and any warnings.

sds_list <- calc_sds(
  data = sim_small,
  vars = vars,
  ref = ref,
  na_strategy = "keep",
  check_extreme = TRUE,
  extreme_strategy = "warn",
  sds_cap = 5,
  return = "list",
  verbose = FALSE
)

str(sds_list$summary)
#> List of 5
#>  $ rows_in      : int 30
#>  $ rows_out     : int 30
#>  $ omitted_rows : int 0
#>  $ total_extreme: int 0
#>  $ per_var      :'data.frame':   2 obs. of  3 variables:
#>   ..$ variable : chr [1:2] "BMI" "sbp"
#>   ..$ n_missing: int [1:2] 0 0
#>   ..$ n_extreme: int [1:2] 0 0
unique(sds_list$warnings)
#> character(0)

Outputs

Added _sds columns (tibble by default)
Optional list: data, summary (row counts, omitted rows, extremes, per-variable missing/extreme), warnings Rows drop only when na_strategy = “omit” or when na_strategy = “error” aborts.

Pitfalls and tips

ref must align with vars and use the same units; sd must be positive and finite.
Lower sds_cap to limit leverage of outliers in plots; use extreme_strategy = “error” for strict QC.
Provide id_col to make troubleshooting easier when reviewing diagnostics.

Validation ideas

Manual check: if mean BMI is 25 and sd is 4, BMI 29 should yield (29-25)/4 = 1.0.
Confirm that rows with missing BMI are retained vs dropped according to na_strategy.
Verify that capping clamps |SDS| at sds_cap symmetrically.

Verbose diagnostics

Set verbose = TRUE to emit three structured messages per call:

Preparing inputs — start-of-function signal.
Column map — lists each variable passed via vars. Example: calc_sds(): column map: BMI -> 'BMI', sbp -> 'sbp'
Results summary — shows how many rows computed successfully (non-NA) per _sds output column. Example: calc_sds(): results: BMI_sds 28/30, sbp_sds 30/30, ...

verbose = TRUE emits at the "inform" level; you also need options(healthmarkers.verbose = "inform") active:

old_opt <- options(healthmarkers.verbose = "inform")

ref_v <- data.frame(variable = c("BMI", "sbp"), mean = c(25, 120), sd = c(4, 15))
df_v  <- data.frame(id = 1:3, BMI = c(24, 29, 30), sbp = c(118, 121, 135))
calc_sds(
  data = df_v,
  vars = c("BMI", "sbp"),
  ref  = ref_v,
  id_col = "id",
  na_strategy = "keep",
  extreme_strategy = "cap",
  sds_cap = 6,
  verbose = TRUE
)
#> calc_sds: starting on 3 row(s), 2 variable(s) (id: id)
#> calc_sds(): col_map: BMI -> 'BMI', sbp -> 'sbp'
#> Computing SDS for 2 variable(s)...
#> calc_sds: completed
#> calc_sds(): results: BMI_sds 3/3, sbp_sds 3/3
#> # A tibble: 3 × 5
#>      id   BMI   sbp BMI_sds sbp_sds
#>   <int> <dbl> <dbl>   <dbl>   <dbl>
#> 1     1    24   118   -0.25 -0.133 
#> 2     2    29   121    1     0.0667
#> 3     3    30   135    1.25  1

options(old_opt)

Reset with options(healthmarkers.verbose = NULL) or "none".