A power estimate is only as trustworthy as the test it is built on. If a design/analysis combination does not control its Type I error rate, a high “power” number is meaningless — the test is just rejecting too often, under the null as well as the alternative. mixpower makes this check a first-class step.
mp_calibrate() runs the scenario under the null (the
focal effect set to zero) and estimates the empirical Type I error rate,
with an exact interval and a verdict. For a correctly specified model it
should sit near alpha.
d <- mp_design(clusters = list(subject = 24), trials_per_cell = 6)
a <- mp_assumptions(
fixed_effects = list(`(Intercept)` = 0, condition = 0.4),
random_effects = list(subject = list(intercept_sd = 0.5)),
residual_sd = 1
)
scn <- mp_scenario_lme4(y ~ condition + (1 | subject), design = d, assumptions = a)
mp_calibrate(scn, nsim = 60, seed = 11)
#> <mp_calibration>
#> term: condition (true effect set to 0)
#> nominal alpha: 0.05
#> empirical Type I: 0.0500 (95% CI 0.0104, 0.1392)
#> verdict: well-calibratedThe classic trap: the data have a by-subject random slope, but the
analysis model omits it. This inflates Type I error (Barr et al., 2013).
mp_calibrate() flags it.
a_slope <- mp_assumptions(
fixed_effects = list(`(Intercept)` = 0, condition = 0.4),
random_effects = list(subject = list(
intercept_sd = 0.5, slopes = list(condition = 0.8)
)),
residual_sd = 1
)
# Data have the slope; the fitted model (1 | subject) ignores it.
scn_mis <- mp_scenario_lme4(y ~ condition + (1 | subject), design = d, assumptions = a_slope)
mp_calibrate(scn_mis, nsim = 60, seed = 7)
#> <mp_calibration>
#> term: condition (true effect set to 0)
#> nominal alpha: 0.05
#> empirical Type I: 0.1667 (95% CI 0.0829, 0.2852)
#> verdict: anti-conservative
#> -> This test rejects too often under the null; its power is not
#> trustworthy. Try a df-corrected or bootstrap test, or a model
#> that matches the data-generating random-effects structure.The verdict turns to anti-conservative: any power
computed from this model would be inflated. The fix is to fit the
maximal model (1 + condition | subject), which mixpower
simulates and tests directly.
Wald (z/t) tests are anti-conservative when the number of clusters is
small, because their degrees of freedom are overstated (Luke, 2017).
mp_recommend_method() gives fast, design-based
guidance.
scn_few <- mp_scenario_lme4(
y ~ condition + (1 | subject),
design = mp_design(list(subject = 12), trials_per_cell = 8),
assumptions = a
)
mp_recommend_method(scn_few)
#> <mp_recommendation>
#> current method: wald
#> smallest grouping factor: 12 levels
#> caution: yes
#> recommended: kenward-roger, satterthwaite, pb
#> Smallest grouping factor has 12 levels (< 30). 'wald' tests can be anti-conservative with few clusters; prefer 'kenward-roger' or 'satterthwaite' or 'pb'. Verify with mp_calibrate().With few clusters it steers you toward a degrees-of-freedom-corrected
test (Satterthwaite or Kenward-Roger for LMMs) or a parametric
bootstrap. You can then measure the improvement with
mp_calibrate():
Calibration results drop into the same reporting layer as power results:
mp_report_table(mp_calibrate(scn, nsim = 40, seed = 1))
#> term alpha type1 ci_low ci_high verdict nsim
#> 1 condition 0.05 0.075 0.01574218 0.2038647 well-calibrated 40Run mp_calibrate() whenever you change the design size
or the analysis model. A trustworthy power analysis reports both the
power and the calibration that backs it.