Multi-Biobank Compatibility: Automatic Column Recognition
Source:vignettes/articles/multi_biobank.Rmd
multi_biobank.RmdOverview
HealthMarkers is designed to work with data from a wide range of international cohort studies and biobanks without any manual column renaming. The internal synonym dictionary covers naming conventions from 15+ major studies. This article explains how the matching works, which biobanks are supported, and how to diagnose and fix cases where a column is not recognised.
How column inference works
Every HealthMarkers function looks up your column names through a five-layer matching pipeline before any computation:
| Layer | Method | Example |
|---|---|---|
| 1 | Exact match in synonym dictionary |
LBXGLU → fasting_glucose
|
| 2 | Case-insensitive exact match |
lbxglu → fasting_glucose
|
| 3 | Data column name contains a synonym (≥ 4 chars) |
plasma_glukoosi → fasting_glucose
|
| 4 | Synonym contains the data column name (≥ 4 chars) |
glukos → fasting_glucose
|
| 5 | Fuzzy (Jaro–Winkler) match, only when fuzzy = TRUE
|
glucos → fasting_glucose
|
The first four layers are active by default. Layer 5 is opt-in and is most useful for catching typos in column names.
Supported biobanks and naming systems
UK Biobank (UKB)
UK Biobank fields follow the pattern analyte_0_0
(assessment visit 0, instance 0) and analyte_1_0
(reassessment). HealthMarkers recognises these for all major
analytes:
| Internal key | UKB field | UKB field code |
|---|---|---|
fasting_glucose |
glucose_0_0 |
30740 |
total_cholesterol |
cholesterol_0_0 |
30690 |
HDL_c |
hdl_cholesterol_0_0 |
30760 |
LDL_c |
ldl_direct_0_0 |
30780 |
TG |
triglycerides_0_0 |
30870 |
ALT |
alanine_aminotransferase_0_0 |
30620 |
AST |
aspartate_aminotransferase_0_0 |
30650 |
creatinine |
creatinine_0_0 |
30700 |
albumin |
albumin_0_0 |
30600 |
HbA1c |
glycated_haemoglobin_hba1c_0_0 |
30750 |
vitaminD |
vitamin_d_0_0 |
30890 |
Hgb |
haemoglobin_concentration_0_0 |
30020 |
WBC |
white_blood_cell_leucocyte_count_0_0 |
30000 |
platelets |
platelet_count_0_0 |
30080 |
SBP |
systolic_blood_pressure_0_0 |
4080 |
DBP |
diastolic_blood_pressure_0_0 |
4079 |
height |
standing_height_0_0 |
50 |
weight |
body_weight_0_0 |
21002 |
BMI |
body_mass_index_bmi_0_0 |
21001 |
waist |
waist_circumference_0_0 |
48 |
testosterone |
testosterone_0_0 |
30850 |
TSH |
thyroid_stimulating_hormone_tsh_0_0 |
30830 |
urea_serum |
urea_0_0 |
30670 |
uric_acid |
urate_0_0 |
30880 |
sodium |
sodium_0_0 |
30530 |
potassium |
potassium_0_0 |
30520 |
calcium |
calcium_0_0 |
30680 |
phosphate |
phosphate_0_0 |
30810 |
NHANES (National Health and Nutrition Examination Survey, USA)
NHANES uses uppercase prefix codes. HealthMarkers recognises both the
LBX (examination) and LBD (derived)
prefixes:
| Internal key | NHANES variable | Questionnaire |
|---|---|---|
fasting_glucose |
LBXGLU |
Biochemistry Profile |
total_cholesterol |
LBXSCH |
Biochemistry Profile |
HDL_c |
LBDHDD |
HDL-Cholesterol |
LDL_c |
LBDLDL |
Cholesterol-LDL |
TG |
LBXSTR |
Biochemistry Profile |
ALT |
LBXSATSI |
Biochemistry Profile |
AST |
LBXSASSI |
Biochemistry Profile |
creatinine |
LBXSCR |
Biochemistry Profile |
albumin |
LBXSAL |
Biochemistry Profile |
HbA1c |
LBXGH |
Glycohemoglobin |
vitaminD |
LBXVD2 |
Vitamin D |
Hgb |
LBXHGB |
CBC |
WBC |
LBXWBCSI |
CBC |
platelets |
LBXPLTSI |
CBC |
ferritin |
LBXFER |
Ferritin |
CRP |
LBXHSCRP |
hs-CRP |
SBP |
BPXSY1 |
Blood Pressure |
DBP |
BPXDI1 |
Blood Pressure |
u_albumin |
URXUMA |
Albumin-Creatinine Urine |
u_creatinine |
URXUCR |
Albumin-Creatinine Urine |
Homocysteine |
LBXHCY |
Homocysteine |
Diagnosing column matching with hm_col_report()
Before running any computation on a new dataset, call
hm_col_report() to see exactly which columns are matched
and how:
library(HealthMarkers)
# Load your biobank extract (example: HUNT data with Norwegian column names)
hunt_data <- read.csv("hunt_extract.csv")
# Run the column report
hm_col_report(hunt_data)The output looks like:
── HealthMarkers column report ────────────────────────────────────────────
Data: 56301 rows × 187 columns | Keys in dictionary: 258
key data_column how matched
-------------------- ------------------------ ------------------
age alder col contains synonym ✔
sex kjonn exact ✔
height hoyde exact ✔
fasting_glucose fastende_blodsukker col contains synonym ✔
TG triglyserider col contains synonym ✔
creatinine kreatinin exact ✔
SBP systolisk_blodtrykk col contains synonym ✔
eGFR ─ NOT FOUND ✘
✔ 142 keys matched ✘ 116 keys not found
── col_map template for missing keys ──────────────────────────────────────
col_map <- list(
eGFR = "from_your_data", # fill in your column name
)
For the NOT FOUND keys, either: 1. Add the column to
your data (or compute it from raw inputs), 2. Provide an explicit
col_map entry, or 3. Ignore it — functions will produce
NA for any marker that requires that key.
# Capture the auto-detected map and add manual overrides
cm <- hm_col_report(hunt_data, verbose = FALSE)
cm$eGFR <- "ckd_epi_gfr_ml_min" # fill in the actual column name
results <- all_health_markers(hunt_data, col_map = cm,
which = c("glycemic", "lipid", "renal",
"inflammatory", "liver"))Auto-derived columns
Even when raw computed inputs are not in your data, HealthMarkers can derive them automatically from more basic measurements before marker computation begins. The following secondary variables are auto-derived:
| Derived key | Derived from | Formula/method |
|---|---|---|
eGFR |
creatinine, age, sex
|
CKD-EPI 2021 |
UACR |
u_albumin, u_creatinine
|
ratio (mg/g) |
LDL_c |
TC, HDL_c, TG
|
Friedewald equation |
WHR |
waist, hip
|
ratio |
BMI |
height, weight
|
kg/m² |
MAP |
SBP, DBP
|
(SBP + 2×DBP) / 3 |
PP |
SBP, DBP
|
SBP − DBP |
non_HDL_c |
TC, HDL_c
|
TC − HDL_c |
TC_HDL_ratio |
TC, HDL_c
|
ratio |
TG_HDL_ratio |
TG, HDL_c
|
log₁₀(TG/HDL_c) |
creatinine_ratio |
creatinine, u_creatinine
|
ratio |
This means, for example, that kidney_failure_risk() will
work even if eGFR is absent from your data, as long as
creatinine, age, and sex are
present.
Example: running on UK Biobank data
library(HealthMarkers)
# Load UKB extract (columns follow _0_0 naming)
ukb <- readRDS("ukb_pheno.rds")
# Check column matching — most UKB columns are auto-detected
hm_col_report(ukb)
# Run a broad panel — no col_map needed for UKB standard names
results <- all_health_markers(
data = ukb,
which = c("glycemic", "lipid", "liver", "renal",
"inflammatory", "obesity_metrics", "vitamin"),
verbose = TRUE
)Example: running on NHANES data
# NHANES lab data (LBXGLU, LBXSCH, LBXWBCSI, etc.)
nhanes_lab <- read.csv("nhanes_lab.csv")
nhanes_bp <- read.csv("nhanes_bp.csv")
nhanes_dem <- read.csv("nhanes_demo.csv")
nhanes <- nhanes_lab |>
dplyr::left_join(nhanes_bp, by = "SEQN") |>
dplyr::left_join(nhanes_dem, by = "SEQN")
results <- all_health_markers(nhanes, which = c("glycemic", "lipid",
"renal", "inflammatory"),
verbose = FALSE)Example: running on OMOP / All of Us data (LOINC codes)
# Columns named LOINC_2345_7, LOINC_2160_0, LOINC_718_7, etc.
omop_labs <- dplyr::collect(tbl(con, "measurement_wide"))
# All LOINC_XXXX_X columns are recognised automatically
hm_col_report(omop_labs)
results <- all_health_markers(omop_labs,
which = c("glycemic", "lipid", "renal",
"inflammatory", "vitamin"))Adding a custom biobank
If your biobank uses names not already in the dictionary, you have two options:
Option A — explicit col_map (immediate, no code
changes):
my_col_map <- list(
fasting_glucose = "p_glu_0",
total_cholesterol = "tot_kol",
creatinine = "s_krea",
age = "alder_bij_opname",
sex = "geslacht_f"
)
results <- all_health_markers(data, col_map = my_col_map,
which = c("glycemic", "lipid", "renal"))Option B — open a GitHub issue / pull request to add permanent recognition of the naming system to the dictionary:
Please include: biobank name, the analyte/variable, and the column name(s) used.
Summary
| Biobank / system | Language | Key feature |
|---|---|---|
| UK Biobank | English |
_0_0 field suffix notation |
| NHANES | English |
LBX/LBD/BPX/URX
prefixes |
| Danish registers | Danish + NPU codes |
NPU01994 etc.; kreatinin,
kolesterol
|
| HUNT / Tromsø | Norwegian |
triglyserider, karbamid, double-k
blodtrykk
|
| FinnGen / THL | Finnish |
-iini endings; glukoosi,
kreatiniini
|
| Estonian Biobank | Estonian |
kolesterool (double-o), naatrium (≠
natrium) |
| LifeLines / Rotterdam | Dutch |
ureum (= urea), urinezuur (uric acid) |
| Generation Scotland | English |
SBP_mean, genetic_sex
|
| All of Us / OMOP | LOINC |
LOINC_XXXX_X format |
| NAKO / KORA | German |
Cholesterin, Triglyzeride,
Harnsäure
|