Skip to contents

Overview

HealthMarkers is designed to work with data from a wide range of international cohort studies and biobanks without any manual column renaming. The internal synonym dictionary covers naming conventions from 15+ major studies. This article explains how the matching works, which biobanks are supported, and how to diagnose and fix cases where a column is not recognised.


How column inference works

Every HealthMarkers function looks up your column names through a five-layer matching pipeline before any computation:

Layer Method Example
1 Exact match in synonym dictionary LBXGLUfasting_glucose
2 Case-insensitive exact match lbxglufasting_glucose
3 Data column name contains a synonym (≥ 4 chars) plasma_glukoosifasting_glucose
4 Synonym contains the data column name (≥ 4 chars) glukosfasting_glucose
5 Fuzzy (Jaro–Winkler) match, only when fuzzy = TRUE glucosfasting_glucose

The first four layers are active by default. Layer 5 is opt-in and is most useful for catching typos in column names.


Supported biobanks and naming systems

UK Biobank (UKB)

UK Biobank fields follow the pattern analyte_0_0 (assessment visit 0, instance 0) and analyte_1_0 (reassessment). HealthMarkers recognises these for all major analytes:

Internal key UKB field UKB field code
fasting_glucose glucose_0_0 30740
total_cholesterol cholesterol_0_0 30690
HDL_c hdl_cholesterol_0_0 30760
LDL_c ldl_direct_0_0 30780
TG triglycerides_0_0 30870
ALT alanine_aminotransferase_0_0 30620
AST aspartate_aminotransferase_0_0 30650
creatinine creatinine_0_0 30700
albumin albumin_0_0 30600
HbA1c glycated_haemoglobin_hba1c_0_0 30750
vitaminD vitamin_d_0_0 30890
Hgb haemoglobin_concentration_0_0 30020
WBC white_blood_cell_leucocyte_count_0_0 30000
platelets platelet_count_0_0 30080
SBP systolic_blood_pressure_0_0 4080
DBP diastolic_blood_pressure_0_0 4079
height standing_height_0_0 50
weight body_weight_0_0 21002
BMI body_mass_index_bmi_0_0 21001
waist waist_circumference_0_0 48
testosterone testosterone_0_0 30850
TSH thyroid_stimulating_hormone_tsh_0_0 30830
urea_serum urea_0_0 30670
uric_acid urate_0_0 30880
sodium sodium_0_0 30530
potassium potassium_0_0 30520
calcium calcium_0_0 30680
phosphate phosphate_0_0 30810

NHANES (National Health and Nutrition Examination Survey, USA)

NHANES uses uppercase prefix codes. HealthMarkers recognises both the LBX (examination) and LBD (derived) prefixes:

Internal key NHANES variable Questionnaire
fasting_glucose LBXGLU Biochemistry Profile
total_cholesterol LBXSCH Biochemistry Profile
HDL_c LBDHDD HDL-Cholesterol
LDL_c LBDLDL Cholesterol-LDL
TG LBXSTR Biochemistry Profile
ALT LBXSATSI Biochemistry Profile
AST LBXSASSI Biochemistry Profile
creatinine LBXSCR Biochemistry Profile
albumin LBXSAL Biochemistry Profile
HbA1c LBXGH Glycohemoglobin
vitaminD LBXVD2 Vitamin D
Hgb LBXHGB CBC
WBC LBXWBCSI CBC
platelets LBXPLTSI CBC
ferritin LBXFER Ferritin
CRP LBXHSCRP hs-CRP
SBP BPXSY1 Blood Pressure
DBP BPXDI1 Blood Pressure
u_albumin URXUMA Albumin-Creatinine Urine
u_creatinine URXUCR Albumin-Creatinine Urine
Homocysteine LBXHCY Homocysteine

Diagnosing column matching with hm_col_report()

Before running any computation on a new dataset, call hm_col_report() to see exactly which columns are matched and how:

library(HealthMarkers)

# Load your biobank extract (example: HUNT data with Norwegian column names)
hunt_data <- read.csv("hunt_extract.csv")

# Run the column report
hm_col_report(hunt_data)

The output looks like:

── HealthMarkers column report ────────────────────────────────────────────
 Data: 56301 rows × 187 columns   |   Keys in dictionary: 258

 key                  data_column              how matched
 -------------------- ------------------------ ------------------
 age                  alder                    col contains synonym ✔
 sex                  kjonn                    exact ✔
 height               hoyde                    exact ✔
 fasting_glucose      fastende_blodsukker      col contains synonym ✔
 TG                   triglyserider            col contains synonym ✔
 creatinine           kreatinin                exact ✔
 SBP                  systolisk_blodtrykk      col contains synonym ✔
 eGFR                 ─                        NOT FOUND ✘

 ✔ 142 keys matched   ✘ 116 keys not found

── col_map template for missing keys ──────────────────────────────────────
 col_map <- list(
   eGFR = "from_your_data",  # fill in your column name
 )

For the NOT FOUND keys, either: 1. Add the column to your data (or compute it from raw inputs), 2. Provide an explicit col_map entry, or 3. Ignore it — functions will produce NA for any marker that requires that key.

# Capture the auto-detected map and add manual overrides
cm <- hm_col_report(hunt_data, verbose = FALSE)
cm$eGFR <- "ckd_epi_gfr_ml_min"   # fill in the actual column name

results <- all_health_markers(hunt_data, col_map = cm,
                              which = c("glycemic", "lipid", "renal",
                                        "inflammatory", "liver"))

Auto-derived columns

Even when raw computed inputs are not in your data, HealthMarkers can derive them automatically from more basic measurements before marker computation begins. The following secondary variables are auto-derived:

Derived key Derived from Formula/method
eGFR creatinine, age, sex CKD-EPI 2021
UACR u_albumin, u_creatinine ratio (mg/g)
LDL_c TC, HDL_c, TG Friedewald equation
WHR waist, hip ratio
BMI height, weight kg/m²
MAP SBP, DBP (SBP + 2×DBP) / 3
PP SBP, DBP SBP − DBP
non_HDL_c TC, HDL_c TC − HDL_c
TC_HDL_ratio TC, HDL_c ratio
TG_HDL_ratio TG, HDL_c log₁₀(TG/HDL_c)
creatinine_ratio creatinine, u_creatinine ratio

This means, for example, that kidney_failure_risk() will work even if eGFR is absent from your data, as long as creatinine, age, and sex are present.


Example: running on UK Biobank data

library(HealthMarkers)

# Load UKB extract (columns follow _0_0 naming)
ukb <- readRDS("ukb_pheno.rds")

# Check column matching — most UKB columns are auto-detected
hm_col_report(ukb)

# Run a broad panel — no col_map needed for UKB standard names
results <- all_health_markers(
  data    = ukb,
  which   = c("glycemic", "lipid", "liver", "renal",
              "inflammatory", "obesity_metrics", "vitamin"),
  verbose = TRUE
)

Example: running on NHANES data

# NHANES lab data (LBXGLU, LBXSCH, LBXWBCSI, etc.)
nhanes_lab <- read.csv("nhanes_lab.csv")
nhanes_bp  <- read.csv("nhanes_bp.csv")
nhanes_dem <- read.csv("nhanes_demo.csv")

nhanes <- nhanes_lab |>
  dplyr::left_join(nhanes_bp,  by = "SEQN") |>
  dplyr::left_join(nhanes_dem, by = "SEQN")

results <- all_health_markers(nhanes, which = c("glycemic", "lipid",
                                                 "renal", "inflammatory"),
                               verbose = FALSE)

Example: running on OMOP / All of Us data (LOINC codes)

# Columns named LOINC_2345_7, LOINC_2160_0, LOINC_718_7, etc.
omop_labs <- dplyr::collect(tbl(con, "measurement_wide"))

# All LOINC_XXXX_X columns are recognised automatically
hm_col_report(omop_labs)

results <- all_health_markers(omop_labs,
                              which = c("glycemic", "lipid", "renal",
                                        "inflammatory", "vitamin"))

Adding a custom biobank

If your biobank uses names not already in the dictionary, you have two options:

Option A — explicit col_map (immediate, no code changes):

my_col_map <- list(
  fasting_glucose  = "p_glu_0",
  total_cholesterol = "tot_kol",
  creatinine       = "s_krea",
  age              = "alder_bij_opname",
  sex              = "geslacht_f"
)

results <- all_health_markers(data, col_map = my_col_map,
                              which = c("glycemic", "lipid", "renal"))

Option B — open a GitHub issue / pull request to add permanent recognition of the naming system to the dictionary:

https://github.com/sufyansuleman/HealthMarkers/issues

Please include: biobank name, the analyte/variable, and the column name(s) used.


Summary

Biobank / system Language Key feature
UK Biobank English _0_0 field suffix notation
NHANES English LBX/LBD/BPX/URX prefixes
Danish registers Danish + NPU codes NPU01994 etc.; kreatinin, kolesterol
HUNT / Tromsø Norwegian triglyserider, karbamid, double-k blodtrykk
FinnGen / THL Finnish -iini endings; glukoosi, kreatiniini
Estonian Biobank Estonian kolesterool (double-o), naatrium (≠ natrium)
LifeLines / Rotterdam Dutch ureum (= urea), urinezuur (uric acid)
Generation Scotland English SBP_mean, genetic_sex
All of Us / OMOP LOINC LOINC_XXXX_X format
NAKO / KORA German Cholesterin, Triglyzeride, Harnsäure