Impute missing values in a data.frame or tibble (simple, column-wise)
Source:R/impute_missing.R
impute_missing.RdPerforms deterministic, per-column imputation for numeric variables:
"mean": replace NAs with the column mean
"median": replace NAs with the column median
"zero": replace NAs with 0
"constant": replace NAs with the single value given in
constant
Usage
impute_missing(
data,
method = c("mean", "median", "zero", "constant"),
cols = NULL,
constant = NULL,
na_warn_prop = 0.2,
verbose = FALSE
)Arguments
- data
A data.frame or tibble containing missing values.
- method
Character; one of c("mean","median","zero","constant").
- cols
Optional character vector of column names to impute. Defaults to all numeric columns in
datathat contain at least one NA.- constant
Numeric; single value to use when
method = "constant".- na_warn_prop
Numeric in \([0,1]\); threshold for high-missingness warnings per column. Default 0.2.
- verbose
Logical; if TRUE, prints progress and a completion summary. Default FALSE.
Value
A data.frame/tibble of the same dimensions as data, with the specified
columns' missing values imputed.
Details
Non-numeric columns are left untouched. If cols = NULL, all numeric columns
that have at least one NA are selected automatically. NA positions are the only
values modified; non-NA entries are preserved as-is.
Quality checks:
Warns for high-missingness columns (>=
na_warn_prop).Warns and skips imputation when a column has no non-NA values (mean/median undefined).
Coerces only numeric columns; non-numerics in
colsare skipped with a warning.
Examples
df <- tibble::tibble(a = c(1, NA, 3), b = c(NA, NA, 2), c = letters[1:3])
impute_missing(df, method = "mean")
#> Warning: impute_missing(): column 'a' has high missingness (33.3%).
#> Warning: impute_missing(): column 'b' has high missingness (66.7%).
#> # A tibble: 3 × 3
#> a b c
#> <dbl> <dbl> <chr>
#> 1 1 2 a
#> 2 2 2 b
#> 3 3 2 c
impute_missing(df, method = "median", verbose = TRUE)
#> impute_missing(): preparing inputs (3 rows, 2 column(s), method='median')
#> Warning: impute_missing(): column 'a' has high missingness (33.3%).
#> Warning: impute_missing(): column 'b' has high missingness (66.7%).
#> impute_missing(): results: imputed 3 values across 2 columns [a=1, b=2]
#> # A tibble: 3 × 3
#> a b c
#> <dbl> <dbl> <chr>
#> 1 1 2 a
#> 2 2 2 b
#> 3 3 2 c
impute_missing(df, method = "constant", constant = -1, cols = "a")
#> Warning: impute_missing(): column 'a' has high missingness (33.3%).
#> # A tibble: 3 × 3
#> a b c
#> <dbl> <dbl> <chr>
#> 1 1 NA a
#> 2 -1 NA b
#> 3 3 2 c