Sample size/power simulation using multi-core computing in R

Mohamad Hasan
R, Statistics & Data Science
7 min readNov 29, 2021

--

Rationale

The goal of this simulation is to calculate power/sample size for a given hazard ratio (HR) in survival analysis by taking advantage of modern multicore computers. When we conduct simulation, generally, we simulate datasets for different combinations of sample size and HR then compute the proportion of rejection given the null vs. alternative hypotheses are H₀:HR=1 vs. H₁:HR<1 for treatment vs. control assuming treatment will perform better. Suppose, we are interested in calculating power when true HR={.65,.90,1.0,1.1,1.5} for various combination of the treatment sample size n1={100,200,300} and the control sample size n0={250,350} over 5000 iterations. In other words, we need to fit 5×3×2×5000=150,000 survival models to obtain the power estimates. Here, we show how we can effectively use multicore computers and conduct simulations faster.

Background

Suppose, in a prospective clinical trial the annual dropout rate is 5% with a median dropout time of 12 months and we want to conduct a final analysis at 42 months which includes 24 months of enrollment and 18 months of follow-up time. If the survival time follows exponential distribution with median survival of 36 months for control then the survival rate for control is λ₀=log(2)/36 and for treatment is λ₁=log(2)/(36/HR). Similarly, the dropout time is exponential distribution with a rate λ𝒹=(−log(1−.05))/12 with a median dropout time of 12 months. Furthermore, suppose the patient entry time is uniform and the same for both treatment and control.

The R-code below shows how to perform parallel computing. We accomplish this in three steps:

  1. Define a function that simulates a single dataset, performs survival analysis, and provides the corresponding statistics.
  2. Define another function that performs step (i) for the different combinations of hazard ratios (HR) and sample sizes (n0 and n1).
  3. Apply function using multiple cores to perform parallel computing.
#rm(list=ls())# load necessary libraries for this simulation
library(survival) # for survival analysis
library(tidyverse) # for data manipulation and analysis
library(multidplyr) # for parallel computing
library(parallel) # for finding the number of cores

The function data_and_analysis simulates a single dataset for given parameters then fits Cox Proportional Hazards regression and computes the corresponding statistics. The inputs of this function are different population parameters and the output is the list of two tibble: 1) simulated data and 2) computed statistics. We take advantage of tidyverse nested data structure to keep both data and statistics in case we want to verify the outputs.

data_and_analysis <- function(
hr = .80, # default hazard ratio treatment vs. control
n0 = 100, # default sample size for control
n1 = 50, # default sample size for treatment
dropr = .05, # default dropout rate in percentage
meddrop = 12, # default median dropout time in months
etime = 24, # default enrollment time in months
atime = 42, # default final analysis time in months
medsurv0 = 36) # default median survival time in months for control
{
n = n0 + n1

# calculate rate for control and treatment from median survival time
lambda0 = log(2)/medsurv0
medsurv1 = medsurv0/hr
lambda1 = log(2)/medsurv1

t0 = runif(n)*etime # simulate entry time
td = if(dropr==0){ # simulate dropout time
rep(Inf, n)
} else {
rexp(n, -log(1-dropr)/meddrop)
}
t1 = c(rexp(n1, lambda1), rexp(n0, lambda0)) # simulate end time
arm = c(rep('trt', n1), rep('ctr', n0)) # create arm covariate
status = ifelse(t1+t0>=atime, 0, ifelse(t1<=td, 1, 0))
# assign patient status
time = as.numeric(pmin(atime-t0, pmin(t1, td)))
# obtain survival time

dat <- bind_cols(t0=t0, td=td, t1=t1, arm=arm)

# fit survival model and obtain summary
fit = survival::coxph(Surv(time, status) ~ arm, data = dat) %>% summary()

# estimate true HR and 80% and 90% upper confidence interval
estHR = exp(fit$coefficients[1])
uCi80 = exp(fit$coefficients[1] + qnorm(.90)*fit$coefficients[3])
uCi90 = exp(fit$coefficients[1] + qnorm(.95)*fit$coefficients[3])

# logical output (1, 0) if estimated true HR > .80
pHR = as.numeric(estHR > .80)

# logical output (1, 0) if the upper limit of CI below 1.0
puCi80 = as.numeric(uCi80 < 1.0)
puCi90 = as.numeric(uCi90 < 1.0)

# logical output (1, 0) if null hypothesis (H0: HR=1) is rejected
pPval80 = as.numeric(fit$coefficients[5] < .20)
pPval90 = as.numeric(fit$coefficients[5] < .10)

# keeping simulated data as well as statistics in a nested structure
return(list(data = tibble(arm=arm, status=status, time=time),
stat = tibble(trueHR=hr, n0=n0, n1=n1, n=n,
estHR=estHR, pHR=pHR,
puCi80=puCi80, puCi90=puCi90,
pPval80=pPval80, pPval90=pPval90)))
}

The function power_by_HR_N simulates datasets for the combinations of hr, n0, and n1 for each iteration then obtain the same output as the function data_and_analysis does. It is possible to include iteration within this function, however, our target is to split iteration over different cores to reduce the computational and transitional time.

power_by_HR_N <- function(iter,            # number of iterations
hr = c(.65, .70, .80),
n0 = 350,
n1 = c(30, 60, 90),
dropr = .05,
meddrop = 12,
etime = 24,
atime = 42,
medsurv0 = 36)

{
# find combinations of hr, n0, and n1; then apply function to obtain data and statistics
comboFit <- expand_grid(hr, n0, n1) %>%
mutate(results=pmap(., data_and_analysis))
return(comboFit)
}

First, we apply the function using multiple cores and then apply single-core to compare time and results.

# setup multiple core for parallel computation
cl <- detectCores() # automatically detect the number of cores
cl
## [1] 16cluster <- new_cluster(cl)# include R-libraries and functions to be used to the cluster
cluster_library(cluster, c("tidyverse", "survival", "dplyr"))
cluster_copy(cluster, c('power_by_HR_N', 'data_and_analysis'))
# apply functions using multiple cores
set.seed(1212021)
# Start the clock!
ptm <- proc.time()
iter=5000
d <- bind_cols(iterVal = 1:iter)
iter_result <- d %>%
group_by(iterVal) %>%
partition(cluster) %>%
mutate(output = map(iterVal,
~power_by_HR_N(iter=.,
hr = c(.65, .90, 1.0, 1.1, 1.5),
n0 = c(250, 350),
n1 = c(100, 200, 300)))) %>% collect()
proc.time() - ptm## user system elapsed
## 32.359 1.178 175.843
# End the clock!

This shows the amount of time needed if we use multiple cores.

iter_result## # A tibble: 5,000 x 2
## # Groups: iterVal [5,000]
## iterVal output
## <int> <list>
## 1 1 <tibble [30 × 4]>
## 2 17 <tibble [30 × 4]>
## 3 33 <tibble [30 × 4]>
## 4 49 <tibble [30 × 4]>
## 5 65 <tibble [30 × 4]>
## 6 81 <tibble [30 × 4]>
## 7 97 <tibble [30 × 4]>
## 8 113 <tibble [30 × 4]>
## 9 129 <tibble [30 × 4]>
## 10 145 <tibble [30 × 4]>
## # … with 4,990 more rows

As you can see the output is very unique. The first column iterVal is the index of iterations and the second column output is the list in which each row contains a tibble of 30 rows and 4 columns. The 30 rows represent the combination of hr× n0×n1=30 and the 4 columns represent hr, n0, n1, and results. The results then contain two lists of tibble: 1) the simulated data (data) and 2) the statistics (stat). To see the details run the code below final_result step by step and observe. Here we are interested in the final statistics, which is obtained by using all steps of final_result.

# summarize the output
final_result <- iter_result %>%
unnest(cols=output) %>%
.$results %>%
map_dfr('stat') %>%
group_by(trueHR, n0, n1, n) %>%
summarise_all(mean, na.rm=TRUE)
final_result

## # A tibble: 30 x 10
## # Groups: trueHR, n0, n1 [30]
## trueHR n0 n1 n estHR pHR puCi80 puCi90 pPval80 pPval90
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.65 250 100 350 0.655 0.145 0.809 0.684 0.809 0.684
## 2 0.65 250 200 450 0.658 0.104 0.915 0.841 0.915 0.841
## 3 0.65 250 300 550 0.656 0.078 0.949 0.905 0.949 0.905
## 4 0.65 350 100 450 0.661 0.156 0.808 0.687 0.808 0.687
## 5 0.65 350 200 550 0.657 0.0888 0.936 0.884 0.936 0.884
## 6 0.65 350 300 650 0.655 0.0628 0.974 0.941 0.974 0.941
## 7 0.9 250 100 350 0.914 0.729 0.233 0.132 0.270 0.146
## 8 0.9 250 200 450 0.908 0.784 0.279 0.164 0.303 0.172
## 9 0.9 250 300 550 0.910 0.813 0.297 0.181 0.317 0.189
## 10 0.9 350 100 450 0.907 0.725 0.243 0.139 0.271 0.151
## # … with 20 more rows

Finally, we obtain the result by taking mean over iteration.

To compare the computational time, we also conduct the same analysis using a single-core. The elapsed time using multiple cores is approximately 8 times faster.

# apply functions using single core
# single core----------
set.seed(1212021)
# Start the clock!
ptm <- proc.time()
iter=5000
d <- bind_cols(iterVal = 1:iter)
iter_resultx <- d %>%
group_by(iterVal) %>%
mutate(output = map(iterVal,
~power_by_HR_N(iter=.,
hr = c(.65, .90, 1.0, 1.1, 1.5),
n0 = c(250, 350),
n1 = c(100, 200, 300))))
proc.time() - ptm
## user system elapsed
## 1131.947 1.183 1134.448
# End the clock!

This shows the amount of time needed if we use a single core.

# summarize the output
final_resultx <- iter_result %>%
unnest(cols=output) %>%
.$results %>%
map_dfr('stat') %>%
group_by(trueHR, n0, n1, n) %>%
summarise_all(mean, na.rm=TRUE)
final_resultx

## # A tibble: 30 x 10
## # Groups: trueHR, n0, n1 [30]
## trueHR n0 n1 n estHR pHR puCi80 puCi90 pPval80 pPval90
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.65 250 100 350 0.655 0.145 0.809 0.684 0.809 0.684
## 2 0.65 250 200 450 0.658 0.104 0.915 0.841 0.915 0.841
## 3 0.65 250 300 550 0.656 0.078 0.949 0.905 0.949 0.905
## 4 0.65 350 100 450 0.661 0.156 0.808 0.687 0.808 0.687
## 5 0.65 350 200 550 0.657 0.0888 0.936 0.884 0.936 0.884
## 6 0.65 350 300 650 0.655 0.0628 0.974 0.941 0.974 0.941
## 7 0.9 250 100 350 0.914 0.729 0.233 0.132 0.270 0.146
## 8 0.9 250 200 450 0.908 0.784 0.279 0.164 0.303 0.172
## 9 0.9 250 300 550 0.910 0.813 0.297 0.181 0.317 0.189
## 10 0.9 350 100 450 0.907 0.725 0.243 0.139 0.271 0.151
## # … with 20 more rows

Conclusion

The performance of parallel computing will not be exactly multicore time * number of core = single-core time because of data distribution and collection time, however, parallel processing is faster than single-core processing. Since most modern computers are multi-core, we can easily use them without extra resources.

The output from both processes seems aligned perfectly. The columns trueHR, n0=n0, n1=n1, and n=n are the true HR, and sample size of the control, treatment, and combined, respectively. The column estHR shows estimated HR of the true HR, pHR is the proportion of estimated HR that is greater than .80, puCi80 and puCi90 are the proportion of 80% and 90% upper confidence limit that are less than 1.0, respectively, and finally pPval80 and pPval90 show the proportion of rejected null hypothesis when α=.20 and .10. The last four columns also represent the power for the respective combination of the sample size and HR, and pPval80 and pPval90 show the type-I error when the true HR=1.

--

--