Zhang, J. (2019). Forward Beamforming and Inferring Functional Connectivity with MEG Data. TAB.
The conventional beamformers that reconstruct the cerebral origin of brain activity measured outside the head via electro- and magnetoencephalography (EEG/MEG) suffer from depth bias and smearing of nearby sources.
Here, to meet these methodological challenges, we propose a depth-invariant and forward beamformer for magnetoencephalography (MEG) data. Based on the new proposal, we further develop a two-step approach for inferring functional connectivity in the brain.
The proposed methodology has nice features including its invariance with respect to source depths in the brain, nulling smearing of nearby sources and allowing for time-varying source orientations. We illustrate the new approach with MEG data derived from a face-perception experiment, revealing patterns of functional connectivity for face perception. We identify a set of brain regions where their responses and connectivity are significantly varying when stimuli alter between faces and scrambled faces.
By simulation studies, we show that the proposed forward beamformer can outperform the forward methods based on conventional beamformers in terms of localization bias.
Ding, H., Lu, Z., Zhang, J. and Zhang, R. (2018). Semi-functional partial linear quantile regression. Statistics & Probability Letters [Online] 142:92-101. Available at: https://doi.org/10.1016/j.spl.2018.07.007.
Semi-functional partial linear model is a flexible model in which a scalar response is related to both functional covariate and scalar covariates. We propose a quantile estimation of this model as an alternative to the least square approach. We also extend the proposed method to kNN quantile method. Under some regular conditions, we establish the asymptotic normality of quantile estimators of regression coefficient. We also derive the rates of convergence of nonparametric function. Finite-sample performance of our estimation is compared with least square approach via a Monte Carlo simulation study. The simulation results indicate that our method is much more robust than the least square method. A real data example about spectrometric data is used to illustrate that our model and approach are promising.
Ding, H., Zhang, R. and Zhang, J. (2018). Quantile Estimation for a Hybrid Model of Functional and Varying Coefficient Regressions. Journal of Statistical Planning and Inference [Online] 196:1-18. Available at: https://doi.org/10.1016/j.jspi.2017.10.005.
We consider a hybrid of functional and varying-coefficient regression models for the analysis of mixed functional data. We propose a quantile estimation of this hybrid model as an alternative to the least square approach. Under regularity conditions, we establish the asymptotic normality of the proposed estimator. We show that the estimated slope function can attain the minimax convergence rate as in functional linear regression. A Monte Carlo simulation study and a real data application suggest that the proposed estimation is promising.
Ali, F. and Zhang, J. (2017). Mixture Model-Based Association Analysis with Case-Control Data in Genome Wide Association Studies. Statistical Applications in Genetics and Molecular Biology [Online] 16. Available at: http://dx.doi.org/10.1515/sagmb-2016-0022.
Multilocus haplotype analysis of candidate variants with genome wide association studies (GWAS) data may provide evidence of association with disease, even when the individual loci themselves do not. Unfortunately, when a large number of candidate variants are investigated, identifying risk haplotypes can be very difficult. To meet the challenge, a number of approaches have been put forward in recent years. However, most of them are not directly linked to the disease-penetrances of haplotypes and thus may not be efficient. To fill this gap, we propose a mixture model-based approach for detecting risk haplotypes. Under the mixture model, haplotypes are clustered directly according to their estimated disease penetrances. A theoretical justification of the above model is provided. Furthermore, we introduce a hypothesis test for haplotype inheritance patterns which underpin this model. The performance of the proposed approach is evaluated by simulations and real data analysis. The simulation results show that the proposed approach outperforms an existing multiple testing method in terms of average specificity and sensitivity. We apply the proposed approach to analyzing two datasets on coronary artery disease and hypertension in the Wellcome Trust Case Control Consortium, identifying many more disease associated haplotype blocks than does the existing method.
Zhang, J. (2016). Screening and Clustering of Sparse Regressions with Finite Non-Gaussian Mixtures. Biometrics [Online] 73:540-550. Available at: http://dx.doi.org/10.1111/biom.12585.
This article proposes a method to address the problem that can arise when covariates
in a regression setting are not Gaussian, which may give rise to approximately mixture-distributed
errors, or when a true mixture of regressions produced the data. The method begins with non-
Gaussian mixture-based marginal variable screening, followed by fitting a full but relatively smaller
mixture regression model to the selected data with help of a new penalization scheme. Under certain
regularity conditions, the new screening procedure is shown to possess a sure screening property
even when the population is heterogeneous. We further prove that there exists an elbow-point in
the associated scree plot which results in a consistent estimator of the set of active covariates in
the model. By simulations, we demonstrate that the new procedure can substantially improve the
performance of the existing procedures in the content of variable screening and data clustering. By
applying the proposed procedure to motif data analysis in molecular biology, we demonstrate that
the new method holds promise in practice.
Zhang, J. and Su, L. (2016). Temporal Autocorrelation-Based Beamforming with MEG Neuroimaging Data. Journal of the American Statistical Association [Online] 110:1375-1388. Available at: http://dx.doi.org/10.1080/01621459.2015.1054488.
Characterizing the brain source activity using Magnetoencephalography (MEG) requires solving an ill-posed inverse problem.
Most source reconstruction procedures are performed in terms of power comparison. However, in the presence of voxel-specific noises, the direct power analysis can be misleading due to the power distortion as suggested by our multiple trial MEG study on a face-perception experiment. To tackle the issue, we propose a temporal autocorrelation-based method for the above analysis. The new method improves the face-perception analysis and identifies several differences between neuronal responses to face and scrambled-face stimuli. By the simulated and real data analyses, we demonstrate that compared to the existing methods, the new proposal can be more robust to voxel-specific noises without compromising on its accuracy in source localization. We further establish the consistency for estimating the proposed index when the number of sensors and the number of time instants are sufficiently large. In particular, we show that the proposed procedure can make a better focus on true sources than its precedents in terms of peak segregation coefficient.
Ali, F. and Zhang, J. (2015). Screening tests for Disease Risk Haplotype Segments in Genome by Use of Permutation. Journal of Systems Science and Mathematical Sciences [Online] 35:1402-1417. Available at: http://en.cnki.com.cn/Journal_en/A-A003-STYS-2015-12.htm.
The haplotype association analysis has been proposed to capture the collective behavior of sets of variants by testing the association of each set instead of individual variants with the disease. Such an analysis typically involves a list of unphased multiple-locus genotypes with potentially sparse frequencies in cases and controls. It starts with inferring haplotypes from genotypes followed by a haplotype co-classification and marginal screening for disease-associated haplotypes. Unfortunately, phasing uncertainty may have a strong effects on the haplotype co-classification and therefore on the accuracy of predicting risk haplotypes. Here, to address the issue, we propose an alternative approach: In Stage 1, we select potential risk genotypes instead of co-classification of the inferred haplotypes. In Stage 2, we infer risk haplotypes from the genotypes inferred from the previous stage. The performance of the proposed procedure is assessed by simulation studies and a real data analysis. Compared to the existing multiple Z-test procedure, we find that the power of genome-wide association studies can be increased by using the proposed procedure.
Zhang, J. (2015). On Nonparametric Feature Filters in Electromagnetic Imaging. Journal of Statistical Planning and Inference [Online] 164:39-53. Available at: http://dx.doi.org/10.1016/j.jspi.2015.03.004.
Estimation of sparse time-varying coefficients on the basis of time-dependent observations is one of the most challenging problems in statistics. Our study was mainly motivated
from magnetoencephalographic neuroimaging, where we want to
identify neural activities using the magnetoencephalographic sensor measurements outside the brain. The problem is ill-posed since the observed magnetic field could result from an infinite number of possible neuronal sources. The so-called minimum-variance beamformer is one of data-adaptive nonparametric feature filters to address the above problem in the literature. In this paper, we propose a method of sure feature filtering for a high-dimensional time-varying coefficient model. The new method assumes that the correlation structure of the sensor measurements can be well represented by a set of non-orthogonal
variance-covariance components. We develop a theory on the sure screening property of the proposed filters and on when the beamformer-based location estimators are consistent or inconsistent with the true ones. We also derive the lower and upper bounds for the mean filtering errors of the proposed method. The new theory is further supported by simulations and a real data analysis.
Zhang, J. and Liu, C. (2015). On Linearly Constrained Minimum Variance Beamforming. Journal of Machine Learning Research [Online] 16. Available at: https://dl.acm.org/doi/10.5555/2789272.2886818.
Beamforming is a widely used technique for source localization in signal processing and neuroimaging. A number of vector-beamformers have been introduced to localize neuronal activity by using
magnetoencephalography (MEG) data in the literature. However, the existing theoretical analyses
on these beamformers have been limited to simple cases, where no more than two sources are allowed in the associated model and the theoretical sensor covariance
is also assumed known. The information about the effects of the MEG spatial and temporal dimensions on the consistency of vector-beamforming is incomplete.
In the present study, we consider a class of vector-beamformers defined by thresholding the sensor covariance matrix, which include the standard vector-beamformer as a special case.
A general asymptotic theory is developed for
these vector-beamformers, which shows the extent of effects to which the MEG spatial and temporal dimensions on estimating the neuronal activity index. The performances of the proposed beamformers are assessed by simulation studies. Superior performances of the proposed beamformers are obtained
when the signal-to-noise ratio is low.
We apply the proposed procedure to real MEG datasets derived from five sessions of a human face-perception experiment, finding several highly active areas in the brain. A good agreement between these findings and the known neurophysiology of the MEG response to human face perception is shown.
Ali, F. and Zhang, J. (2015). Search for Risk Haplotype Segments with GWAS Data by Use of Finite Mixture Models. Statistics and its interface [Online] 9:267-280. Available at: http://dx.doi.org/10.4310/SII.2016.v9.n3.a2.
The region-based association analysis has been proposed to capture the
collective behavior of sets of variants by testing the association of each set instead of individual variants with the disease. Such an analysis typically
involves a list of unphased multiple-locus genotypes with
potentially sparse frequencies in cases and controls.
To tackle the problem of the sparse distribution, a two-stage approach was proposed in literature: In the first stage, haplotypes are computationally inferred from genotypes, followed by a haplotype co-classification. In the second stage, the association analysis is performed on the inferred haplotype groups. If a haplotype is unevenly distributed between the case and control samples, this
haplotype is labeled as a risk haplotype. Unfortunately, the in-silico reconstruction of haplotypes might produce a proportion of
false haplotypes which hamper the detection of rare but true
haplotypes. Here, to address the issue, we propose an alternative approach: In Stage 1, we cluster genotypes instead of inferred haplotypes and estimate the
risk genotypes based on a finite mixture model. In Stage 2, we infer risk haplotypes from risk genotypes inferred from the
To estimate the finite mixture model, we propose an EM algorithm with a novel data partition-based initialization.
The performance of the proposed procedure is assessed by
simulation studies and a real data analysis. Compared to the existing
multiple Z-test procedure, we find that the power of genome-wide association studies can be increased by using the proposed procedure.
Zhang, J., Liu, C. and Green, G. (2014). Source Localization with MEG Data: A Beamforming Approach Based on Covariance Thresholding. Biometrics [Online] 70:121-131. Available at: http://dx.doi.org/10.1111/biom.12123.
Reconstructing neural activities using non-invasive sensor arrays outside the brain is an ill-posed inverse problem
since the observed sensor measurements could result from an infinite number of possible neuronal sources. The sensor
covariance-based beamformer mapping represents a popular and simple solution to the above problem. In this article, we
propose a family of beamformers by using covariance thresholding. A general theory is developed on how their spatial and
temporal dimensions determine their performance. Conditions are provided for the convergence rate of the associated beamformer
estimation. The implications of the theory are illustrated by simulations and a real data analysis.
Zhang, J. (2013). Epistatic Clustering: A Model-Based Approach for Identifying Links Between Clusters. Journal of the American Statistical Association [Online] 108:1366-1384. Available at: http://dx.doi.org/10.1080/01621459.2013.835661.
Most clustering methods assume that the data can be represented by mutually exclusive clusters, although this assumption may not be the
case in practice. For example, in gene expression microarray studies, investigators have often found that a gene can play multiple functions
in a cell and may, therefore, belong to more than one cluster simultaneously, and that gene clusters can be linked to each other in certain
pathways. This article examines the effect of the above assumption on the likelihood of finding latent clusters using theoretical calculations
and simulation studies, for which the epistatic structures were known in advance, and on real data analyses. To explore potential links
between clusters, we introduce an epistatic mixture model which extends the Gaussian mixture by including epistatic terms. A generalized
expectation-maximization (EM) algorithm is developed to compute the related maximum likelihood estimators. The Bayesian information
criterion is then used to determine the order of the proposed model. A bootstrap test is proposed for testing whether the epistatic mixture
model is a significantly better fit to the data than a standard mixture model in which each data point belongs to one cluster. The asymptotic
properties of the proposed estimators are also investigated when the number of analysis units is large. The results demonstrate that the
epistatic links between clusters do have a serious effect on the accuracy of clustering and that our epistatic approach can substantially reduce
such an effect and improve the fit.
Zhang, J. (2012). Generalized plaid models. Neurocomputing [Online] 79:95-104. Available at: http://dx.doi.org/10.1016/j.neucom.2011.10.011.
The problem of two-way clustering has attracted considerable attention in diverse research areas such as functional genomics, text mining, and market research, where people want to simultaneously cluster rows and columns of a data matrix. In this paper, we propose a family of generalized plaid models for two-way clustering, where the layer estimation is regularized by Bayesian Information Criterion (BIC).
The new models have broadened the scope of ordinary plaid models by
specifying the variance function to make the models adaptive to the entire distribution of the error term. A formal test is provided for finding significant layers. A Metropolis algorithm is also developed to calculate the maximum likelihood estimators of unknown parameters in the proposed models. Three simulation studies and the applications to two real datasets are reported, which demonstrate that our procedure is promising.
Zhang, J. and Liang, F. (2010). Robust Clustering Using Exponential Power Mixtures. Biometrics [Online] 66:1078-1086. Available at: http://dx.doi.org/10.1111/j.1541-0420.2010.01389.x.
Clustering is a widely used method in extracting useful information from gene expression data, where unknown
correlation structures in genes are believed to persist even after normalization. Such correlation structures pose a great
challenge on the conventional clustering methods, such as the Gaussian mixture (GM) model, k-means (KM), and partitioning
around medoids (PAM), which are not robust against general dependence within data. Here we use the exponential
power mixture model to increase the robustness of clustering against general dependence and nonnormality of the data. An
expectation–conditional maximization algorithm is developed to calculate the maximum likelihood estimators (MLEs) of the
unknown parameters in these mixtures. The Bayesian information criterion is then employed to determine the numbers of
components of the mixture. The MLEs are shown to be consistent under sparse dependence. Our numerical results indicate
that the proposed procedure outperforms GM, KM, and PAM when there are strong correlations or non-Gaussian components
in the data.
Zhang, J. (2010). A Bayesian model for biclustering with applications. Journal of the Royal Statistical Society: Series C (Applied Statistics) [Online] 59:635-656. Available at: http://dx.doi.org/10.1111/j.1467-9876.2010.00716.x.
The paper proposes a Bayesian method for biclustering with applications to gene
microarray studies, where we want to cluster genes and experimental conditions simultaneously.
We begin by embedding bicluster analysis into the framework of a plaid model with random
effects.The corresponding likelihood is then regularized by the hierarchical priors in each layer.
The resulting posterior, which is asymptotically equivalent to a penalized likelihood, can attenuate
the effect of high dimensionality on cluster predictions. We provide an empirical Bayes
algorithm for sampling posteriors, in which we estimate the cluster memberships of all genes
and samples by maximizing an explicit marginal posterior of these memberships.The new algorithm
makes the estimation of the Bayesian plaid model computationally feasible and efficient.
The performance of our procedure is evaluated on both simulated and real microarray gene
expression data sets. The numerical results show that our proposal substantially outperforms
the original plaid model in terms of misclassification rates across a range of scenarios. Applying
our method to two yeast gene expression data sets, we identify several new biclusters which
show the enrichment of known annotations of yeast genes.
Zhang, J. (2009). Learning Bayesian networks for discrete data. Computational Statistics and Data Analysis [Online] 53:865-876. Available at: http://dx.doi.org/10.1016/j.csda.2008.10.007.
Bayesian networks have received much attention in the recent literature. In this article,
we propose an approach to learn Bayesian networks using the stochastic approximation
Monte Carlo (SAMC) algorithm. Our approach has two nice features. Firstly, it possesses
the self-adjusting mechanism and thus avoids essentially the local-trap problem suffered
by conventional MCMC simulation-based approaches in learning Bayesian networks.
Secondly, it falls into the class of dynamic importance sampling algorithms; the network
features can be inferred by dynamically weighted averaging the samples generated in the
learning process, and the resulting estimates can have much lower variation than the single
model-based estimates. The numerical results indicate that our approach can mix much
faster over the space of Bayesian networks than the conventional MCMC simulation-based
Zhang, J. and Liang, F. (2008). Estimating the false discovery rate using the stochastic approximation algorithm. Biometrika [Online] 95:961-977. Available at: http://dx.doi.org/10.1093/biomet/asn036.
Testing of multiple hypotheses involves statistics that are strongly dependent in some applications,
but most work on this subject is based on the assumption of independence. We propose
a new method for estimating the false discovery rate of multiple hypothesis tests, in which the
density of test scores is estimated parametrically by minimizing the Kullback–Leibler distance
between the unknown density and its estimator using the stochastic approximation algorithm,
and the false discovery rate is estimated using the ensemble averaging method. Our method is
applicable under general dependence between test statistics. Numerical comparisons between our
method and several competitors, conducted on simulated and real data examples, show that our
method achieves more accurate control of the false discovery rate in almost all scenarios.
Zhang, J., Regieli, J., Schipper, M., Entius, M., Liang, F., Koerselman, J., Ruven, H., van der Graaf, Y., Grobbee, D. and Doevendans, P. (2008). Inflammatory Gene Haplotype-Interaction Networks Involved in Coronary Collateral Formation. Human Heredity [Online] 66:252-264. Available at: http://dx.doi.org/10.1159/000143407.
Objectives: Formation of collateral circulation is an endogenous
response to atherosclerosis, and is a natural escape
mechanism by re-routing blood. Inflammatory responserelated
genes underlie the formation of coronary collaterals.
We explored the genetic basis of collateral formation in man
postulating interaction networks between functional Single
Nucleotide Polymorphisms (SNPs) in these inflammatory
gene candidates. Methods: The contribution of 41 genes as
well as the interactions among them was examined in a cohort
of 226 coronary artery disease patients, genotyped for
54 candidate SNPs. Patients were classified to the extent of
collateral circulation. Stepwise logistic regression analysis
and a haplotype entropy procedure were applied to search
for haplotype interactions among all 54 polymorphisms.
Multiple testing was addressed by using the false discovery
rate (FDR) method. Results: The population comprised 84
patients with and 142 without visible collaterals. Among the
41 genes, 16 pairs of SNPs were implicated in the development
of collaterals with the FDR of 0.19. Nine SNPs were
found to potentially have main effects on collateral formation.
Two sets of coupling haplotypes that predispose to collateral
formation were suggested. Conclusions: These findings
suggest that collateral formation may arise from the
interactions between several SNPs in inflammatory response
related genes, which may represent targets in future studies
of collateral formation. This may enhance developing strategies
for risk stratification and therapeutic stimulation of arteriogenesis.
van Greevenbroek, M., Zhang, J., van der Kallen, C., Schiffers, P., Feskens, E. and de Bruin, T. (2008). Effects of interacting networks of cardiovascular risk genes on the risk of type 2 diabetes mellitus (the CODAM study). BMC Medical Genetics [Online] 9. Available at: http://dx.doi.org/3610.1186/1471-2350-9-36.
Background: Genetic dissection of complex diseases requires innovative approaches for identification of disease-predisposing genes. A well-known example of a human complex disease with a strong genetic component is Type 2 Diabetes Mellitus (T2DM). Methods: We genotyped normal-glucose-tolerant subjects (NGT; n = 54), subjects with an impaired glucose metabolism (IGM; n = 111) and T2DM (n = 142) subjects, in an assay (designed by Roche Molecular Systems) for detection of 68 polymorphisms in 36 cardiovascular risk genes. Using the single-locus logistic regression and the so-called haplotype entropy, we explored the possibility that (1) common pathways underlie development of T2DM and cardiovascular disease which would imply enrichment of cardiovascular risk polymorphisms in "pre-diabetic" (IGM) and diabetic (T2DM) populations- and (2) that gene-gene interactions are relevant for the effects of risk polymorphisms. Results: In single-locus analyses, we showed suggestive association with disturbed glucose metabolism (i.e. subjects who were either IGM or had T2DM), or with T2DM only. Moreover, in the haplotype entropy analysis, we identified a total of 14 pairs of polymorphisms (with a false discovery rate of 0.125) that may confer risk of disturbed glucose metabolism, or T2DM only, as members of interacting networks of genes. We substantiated gene-gene interactions by showing that these interacting networks can indeed identify potential "disease-predisposing allele-combinations". Conclusion: Gene-gene interactions of cardiovascular risk polymorphisms can be detected in prediabetes and T2DM, supporting the hypothesis that common pathways may underlie development of T2DM and cardiovascular disease. Thus, a specific set of risk polymorphisms, when simultaneously present, increases the risk of disease and hence is indeed relevant in the transfer of risk.
Zhang, J. and Liang, F. (2008). Convergence of Stochastic approximation algorithm under irregular conditions. Statistica Neerlandica [Online] 62:393-403. Available at: http://dx.doi.org/10.1111/j.1467-9574.2008.00397.x.
We consider a class of stochastic approximation (SA) algorithms for
solving a system of estimating equations. The standard condition for
the convergence of the SA algorithms is that the estimating functions
are locally Lipschitz continuous. Here, we show that this condition can
be relaxed to the extent that the estimating functions are bounded
and continuous almost everywhere. As a consequence, the use of the
SA algorithm can be extended to some problems with irregular estimating
functions. Our theoretical results are illustrated by solving an
estimation problem for exponential power mixture models.
Ahmad, N., Zhang, J., Brown, P., James, D., Birch, J., Racher, A. and Smales, C. (2006). On the statistical analysis of the GS-NS0 cell proteome: Imputation, clustering and variability testing. Biochimica Et Biophysica Acta-Proteins and Proteomics [Online] 1764:1179-1187. Available at: http://dx.doi.org/10.1016/j.bbapap.2006.05.002.
We have undertaken two-dimensional gel electrophoresis proteomic profiling on a series of cell lines with different recombinant antibody production rates. Due to the nature of gel-based experiments not all protein spots are detected across all samples in an experiment, and hence datasets are invariably incomplete. New approaches are therefore required for the analysis of such graduated datasets. We approached this problem in two ways. Firstly, we applied a missing value imputation technique to calculate missing data points. Secondly, we combined a singular value decomposition based hierarchical clustering with the expression variability test to identify protein spots whose expression correlates with increased antibody production. The results have shown that while imputation of missing data was a useful method to improve the statistical analysis of such data sets, this was of limited use in differentiating between the samples investigated, and highlighted a small number of candidate proteins for further investigation.
Fan, J. and Zhang, J. (2004). Sieve empirical likelihood ratio tests for nonparametric functions. Annals of Statistics 32:1858-1907.
Generalized likelihood ratio statistics have been proposed in Fan, Zhang and Zhang [Ann. Statist. 29 (2001) 153-193] as a generally applicable method for testing norparametic hypotheses about nonparametric functions. The likelihood ratio statistics are constructed based on the assumption that the distributions of stochastic errors are in a certain parametric family. We extend their work to the case where the error distribution is completely unspecified via newly proposed sieve empirical likelihood ratio (SELR) tests. The approach is also applied to test conditional estimating equations on the distributions of stochastic errors. It is shown that the proposed SELR statistics follow asymptotically resealed chi(2)-distributions, with the scale constants and the degrees of freedom being independent of the nuisance parameters. This demonstrates that the Wilks phenomenon observed in Fan, Zhang and Zhang [Ann. Statist. 29 (2001) 153-193] continues to hold under more relaxed models and a larger class of techniques. The asymptotic power of the proposed test is also derived, which achieves the optimal rate for nonparametric hypothesis testing. The proposed approach has two advantages over the generalized likelihood ratio method: it requires one only to specify some conditional estimating equations rather than the entire distribution of the stochastic error, and the procedure adapts automatically to the unknown error distribution including heteroscedasticity. A simulation study is conducted to evaluate our proposed procedure empirically.
Zhang, J., Liang, F., Dassen, W., Doevendans, P. and de Gunst, M. (2003). Search for haplotype interactions that influence susceptibility to type 1 diabetes, through use of unphased genotype data. American Journal of Human Genetics [Online] 73:1385-1401. Available at: https://doi.org/DOI not available.
Type 1 diabetes is a T-cell-mediated chronic disease characterized by the autoimmune destruction of pancreatic insulin-producing beta cells and complete insulin deficiency. It is the result of a complex interrelation of genetic and environmental factors, most of which have yet to be identified. Simultaneous identification of these genetic factors, through use of unphased genotype data, has received increasing attention in the past few years. Several approaches have been described, such as the modified transmission/disequilibrium test procedure, the conditional extended transmission/disequilibrium test, and the stepwise logistic-regression procedure. These approaches are limited either by being restricted to family data or by ignoring so-called "haplotype interactions" between alleles. To overcome this limit, the present study provides a general method to identify, on the basis of unphased genotype data, the haplotype blocks that interact to define the risk for a complex disease. The principle underpinning the proposal is minimal entropy. The performance of our procedure is illustrated for both simulated and real data. In particular, for a set of Dutch type 1 diabetes data, our procedure suggests some novel evidence of the interactions between and within haplotype blocks that are across chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 15, 16, 17, 19, and 21. The results demonstrate that, by considering interactions between potential disease haplotype blocks, we may succeed in identifying disease-predisposing genetic variants that might otherwise have remained undetected.
Zhang, J. and Gijbels, I. (2003). Sieve empirical likelihood and extensions of the generalized least squares. Scandinavian Journal of Statistics [Online] 30:1-24. Available at: http://dx.doi.org/10.1111/1467-9469.t01-1-00315.
The empirical likelihood cannot be used directly sometimes when an infinite dimensional parameter of interest is involved. To overcome this difficulty, the sieve empirical likelihoods are introduced in this paper. Based on the sieve empirical likelihoods, a unified procedure is developed for estimation of constrained parametric or non-parametric regression models with unspecified error distributions. It shows some interesting connections with certain extensions of the generalized least squares approach. A general asymptotic theory is provided. In the parametric regression setting it is shown that under certain regularity conditions the proposed estimators are asymptotically efficient even if the restriction functions are discontinuous. In the non-parametric regression setting the convergence rate of the maximum estimator based on the sieve empirical likelihood is given. In both settings, it is shown that the estimator is adaptive for the inhomogeneity of conditional error distributions with respect to predictor, especially for heteroscedasticity.
Fan, J., Zhang, C. and Zhang, J. (2001). Generalized likelihood ratio statistics and Wilks phenomenon. Annals of Statistics 29:153-193.
Likelihood ratio theory has had tremendous success in parametric inference, due to the fundamental theory of Wilks. Yet, there is no general applicable approach for nonparametric inferences based on function estimation. Maximum likelihood ratio test statistics in general may not exist in nonparametric function estimation setting. Even if they exist, they are hard to find and can not; be optimal as shown in this paper. We introduce the generalized likelihood statistics to overcome the drawbacks of nonparametric maximum likelihood ratio statistics. A new S Wilks phenomenon is unveiled. We demonstrate that a class of the generalized likelihood statistics based on some appropriate nonparametric estimators are asymptotically distribution free and follow chi (2)-distributions under null hypotheses for a number of useful hypotheses and a variety of useful models including Gaussian white noise models, nonparametric regression models, varying coefficient models and generalized varying coefficient models. We further demonstrate that generalized likelihood ratio statistics are asymptotically optimal in the sense that they achieve optimal rates of convergence given by Ingster. They can even be adaptively optimal in the sense of Spokoiny by using a simple choice of adaptive smoothing parameter. Our work indicates that the generalized likelihood ratio statistics are indeed general and powerful for nonparametric testing problems based on function estimation.