Title: | Large-Scale Hypothesis Testing by Variance Mixing |
---|---|
Description: | Implements large-scale hypothesis testing by variance mixing. It takes two statistics per testing unit -- an estimated effect and its associated squared standard error -- and fits a nonparametric, shape-constrained mixture separately on two latent parameters. It reports local false discovery rates (lfdr) and local false sign rates (lfsr). Manuscript describing algorithm of MixTwice: Zheng et al(2021) <doi: 10.1093/bioinformatics/btab162>. |
Authors: | Zihao Zheng and Michael A.Newton |
Maintainer: | Zihao Zheng <[email protected]> |
License: | GPL-2 |
Version: | 2.0 |
Built: | 2024-11-18 06:24:59 UTC |
Source: | https://github.com/cran/MixTwice |
Implements large-scale hypothesis testing by variance mixing. It takes two statistics per testing unit – an estimated effect and its associated squared standard error – and fits a nonparametric, shape-constrained mixture separately on two latent parameters. It reports local false discovery rates (lfdr) and local false sign rates (lfsr). Manuscript describing algorithm of MixTwice: Zheng et al(2021) <doi: 10.1093/bioinformatics/btab162>.
The DESCRIPTION file:
Package: | MixTwice |
Type: | Package |
Title: | Large-Scale Hypothesis Testing by Variance Mixing |
Version: | 2.0 |
Imports: | alabama, ashr, fdrtool, Iso, stats |
Date: | 2022-02-28 |
Author: | Zihao Zheng and Michael A.Newton |
Maintainer: | Zihao Zheng <[email protected]> |
Description: | Implements large-scale hypothesis testing by variance mixing. It takes two statistics per testing unit -- an estimated effect and its associated squared standard error -- and fits a nonparametric, shape-constrained mixture separately on two latent parameters. It reports local false discovery rates (lfdr) and local false sign rates (lfsr). Manuscript describing algorithm of MixTwice: Zheng et al(2021) <doi: 10.1093/bioinformatics/btab162>. |
License: | GPL-2 |
NeedsCompilation: | no |
Packaged: | 2022-03-02 16:05:34 UTC; appli |
Depends: | R (>= 3.5.0) |
Date/Publication: | 2022-03-02 16:50:02 UTC |
Repository: | https://zzheng68.r-universe.dev |
RemoteUrl: | https://github.com/cran/MixTwice |
RemoteRef: | HEAD |
RemoteSha: | f31a2993c756c1f83e32b0c303033bccdb37b880 |
Index of help topics:
MixTwice-package Large-Scale Hypothesis Testing by Variance Mixing mixtwice Large-scale hypothesis testing by variance mixing peptide_data Peptide array data example
Zihao Zheng and Michael A.Newton
Maintainer: Zihao Zheng <[email protected]>
Zheng et al. MixTwice: Large scale hypothesis testing for peptide arrays by variance mixing. Bioinformatics, 2021.
Zheng et al. Disordered Antigens and Epitope Overlap Between Anti Citrullinated Protein Antibodies and Rheumatoid Factor in Rheumatoid Arthritis. Arthritis & Rheumatology 72.2 (2020): 262-272.
data(peptide_data) ## For more detail and example, use ?peptide_data and ?mixtwice
data(peptide_data) ## For more detail and example, use ?peptide_data and ?mixtwice
MixTwice deploys large-scale hypothesis testing in the case when testing units provide estimated effects and estimated standard errors. It produces empirical Bayesian local false discovery and sign rates for tests of zero effect.
mixtwice(thetaHat, s2, Btheta = 15, Bsigma2 = 10, df, method = c("EM-pava", "AugLag"), maxit = 100, prop = 1)
mixtwice(thetaHat, s2, Btheta = 15, Bsigma2 = 10, df, method = c("EM-pava", "AugLag"), maxit = 100, prop = 1)
thetaHat |
Estimated effect sizes (vector over testing units)) |
s2 |
Estimated squared standard errors of thetaHat (vector over testing units) |
Btheta |
Grid size parameter for effect distribution |
Bsigma2 |
Grid size parameter for variance distribution |
df |
Degrees of freedom in chisquare associated with estimated standard error |
method |
Method used for solving the non-parametric MLE optimization. |
maxit |
Number of iterations in EM if |
prop |
Proportion of units randomly selected to fit the distribution, with default, |
mixtwice
takes estimated effects and standard errors over a set of testing units. To compute local error-rate statistics, it finds nonparametric MLEs of the underlying distributions. It is similar to "ashr", except that mixtwice allows both a mixing distribution of underlying effects theta as well as a mixing distribution over underlying variance parameters. Furthermore, it treats the effect mixing distribution nonparametrically, but enforces the shape constraint that this distribution is unimodal with mode at theta=0. (We do not assume symmetry). The distribution of variance parameters is also treated nonparametrically, but with no shape constraints. The observations are assumed to be emitted from a normal distribution (on estimated effects) and an independent Chi-square distribution (on estimated squared standard errors).
grid.theta |
Support of the estimated mixing distribution on effects |
grid.sigma2 |
Support of the estimated mixing distribution of variances |
mix.theta |
Estimated distribution of effect size, on grid.theta |
mix.sigma2 |
Estimated distribution of variance, on grid.sigma2 |
lfdr |
Local false discovery rate for each testing unit |
lfsr |
Local false sign rate for each testing unit |
See Zheng et al. 2021 for further detail
Zihao Zheng, Michael A.Newton
Zheng et al. MixTwice: Large scale hypothesis testing for peptide arrays by variance mixing. Bioinformatics, 2021.
See Also as ?peptide_data
set.seed(1) l = 1000 ## number of testing units neach = 20 ## number of subjects in each group pi0 = 0.8 ## null proportion signal1 = rep(0, l) signal2 = signal1 signal2[1:round((1-pi0)*l)] = rnorm(round((1-pi0)*l), mean = 0, sd = 3) ## I will generate the sigma^2 parameter sigma2 = rep(1,l) sigma = sqrt(sigma2) ## Then I can generate data data1 <- data2 <- matrix(NA, nrow = l, ncol = neach) for (i in 1:l) { data1[i,] = rnorm(neach, mean = rep(signal1[i], each = neach), sd = sigma[i]) data2[i,] = rnorm(neach, mean = rep(signal2[i], each = neach), sd = sigma[i]) } thetaHat = rowMeans(data2) - rowMeans(data1) sd1 = apply(data1, 1, sd) sd2 = apply(data2, 1, sd) s2 = sd1^2/neach + sd2^2/neach fit.EM = mixtwice(thetaHat, s2, Btheta = 15, Bsigma2 = 10, df = 2*neach - 2, method = "EM-pava", maxit = 100, prop = 1) ## you can try to visualize the result plot(fit.EM$grid.theta, cumsum(fit.EM$mix.theta), type = "s", xlab = "grid.theta", ylab = "ecdf of theta", lwd = 2) lines(ecdf(signal2 - signal1),cex = 0.1, lwd = 0.5, lty = 2, col = "red") legend("topleft", legend = c("fit.mix", "true.mix"), col = c("black", "red"), lwd = 1, pch = 19) ## calculate false discovery rate and true positive under 0.05 oo = order(fit.EM$lfdr) # number of discovery x1 = sum(cumsum(fit.EM$lfdr[oo])/c(1:l) <= 0.05) # number of true discovery x2 = sum(cumsum(fit.EM$lfdr[oo])/c(1:l) <= 0.05 & (signal2 != 0)[oo]) # number of real positive x3 = sum(signal2 != 0) # number of false discovery x4 = sum(cumsum(fit.EM$lfdr[oo])/c(1:l) <= 0.05 & (signal2 == 0)[oo]) x4/x1 ## false discovery rate FDR x2/x3 ## true positive rate ## null proportion estimation max(fit.EM$mix.theta) ## you can also try using another method (Augmented Lagrangian), the result would be similar # fit.AugLag = mixtwice2(thetaHat, s2, Btheta = 15, Bsigma2 = 10, df = 2*neach - 2, # method = "AugLag", prop = 1)
set.seed(1) l = 1000 ## number of testing units neach = 20 ## number of subjects in each group pi0 = 0.8 ## null proportion signal1 = rep(0, l) signal2 = signal1 signal2[1:round((1-pi0)*l)] = rnorm(round((1-pi0)*l), mean = 0, sd = 3) ## I will generate the sigma^2 parameter sigma2 = rep(1,l) sigma = sqrt(sigma2) ## Then I can generate data data1 <- data2 <- matrix(NA, nrow = l, ncol = neach) for (i in 1:l) { data1[i,] = rnorm(neach, mean = rep(signal1[i], each = neach), sd = sigma[i]) data2[i,] = rnorm(neach, mean = rep(signal2[i], each = neach), sd = sigma[i]) } thetaHat = rowMeans(data2) - rowMeans(data1) sd1 = apply(data1, 1, sd) sd2 = apply(data2, 1, sd) s2 = sd1^2/neach + sd2^2/neach fit.EM = mixtwice(thetaHat, s2, Btheta = 15, Bsigma2 = 10, df = 2*neach - 2, method = "EM-pava", maxit = 100, prop = 1) ## you can try to visualize the result plot(fit.EM$grid.theta, cumsum(fit.EM$mix.theta), type = "s", xlab = "grid.theta", ylab = "ecdf of theta", lwd = 2) lines(ecdf(signal2 - signal1),cex = 0.1, lwd = 0.5, lty = 2, col = "red") legend("topleft", legend = c("fit.mix", "true.mix"), col = c("black", "red"), lwd = 1, pch = 19) ## calculate false discovery rate and true positive under 0.05 oo = order(fit.EM$lfdr) # number of discovery x1 = sum(cumsum(fit.EM$lfdr[oo])/c(1:l) <= 0.05) # number of true discovery x2 = sum(cumsum(fit.EM$lfdr[oo])/c(1:l) <= 0.05 & (signal2 != 0)[oo]) # number of real positive x3 = sum(signal2 != 0) # number of false discovery x4 = sum(cumsum(fit.EM$lfdr[oo])/c(1:l) <= 0.05 & (signal2 == 0)[oo]) x4/x1 ## false discovery rate FDR x2/x3 ## true positive rate ## null proportion estimation max(fit.EM$mix.theta) ## you can also try using another method (Augmented Lagrangian), the result would be similar # fit.AugLag = mixtwice2(thetaHat, s2, Btheta = 15, Bsigma2 = 10, df = 2*neach - 2, # method = "AugLag", prop = 1)
A high-density peptide microarray example to identify peptides for which antibody binding levels differ between control subjects and rheumatoid arthritis (RA) patients expressing a specific disease marker combination (i.e., CCP+RF+ RA).
data("peptide_data")
data("peptide_data")
A data frame with 152603 observations on the following 16 variables.
The first 8 columns are RA patients and the remaining columns are from control subjects.
Each row of the data (rownames(peptide_data)
) is a probed length-12 peptide and each column of the data (colnames(peptide_data)
) is a subject with distinct pseudo sample ID. The binding value is doubly-log transformed using natural base to stabilize variance.
Zheng, Zihao, et al. Disordered Antigens and Epitope Overlap Between Anti Citrullinated Protein Antibodies and Rheumatoid Factor in Rheumatoid Arthritis. Arthritis & Rheumatology 72.2 (2020): 262-272.
Zheng et al. MixTwice: Large scale hypothesis testing for peptide arrays by variance mixing. Bioinformatics, 2021.
#### load the RA data data(peptide_data) #### visualize the data ## each row is a peptide with unique peptide sequence ## each column is a subject with information on group and pseudo ID colnames(peptide_data) ## z-score for peptide get_zscore = function(x){ n = length(x) t = t.test(x[1:(n/2)], x[(n/2 + 1):n], var.equal = TRUE)$statistic return(qnorm(pt(t, df = n-2))) } z = apply(peptide_data, 1, get_zscore) ## visualize the density of z-score hist(z, probability = TRUE, 100, ylim = c(0,0.4), col = "blue") lines(density(rnorm(10^5)), lwd =2)
#### load the RA data data(peptide_data) #### visualize the data ## each row is a peptide with unique peptide sequence ## each column is a subject with information on group and pseudo ID colnames(peptide_data) ## z-score for peptide get_zscore = function(x){ n = length(x) t = t.test(x[1:(n/2)], x[(n/2 + 1):n], var.equal = TRUE)$statistic return(qnorm(pt(t, df = n-2))) } z = apply(peptide_data, 1, get_zscore) ## visualize the density of z-score hist(z, probability = TRUE, 100, ylim = c(0,0.4), col = "blue") lines(density(rnorm(10^5)), lwd =2)