Portrait of Professor Jian Zhang

Professor Jian Zhang

Professor of Statistics
Head of the Statistics Group

About

Professor Zhang is Head of the Statistics group. His research interests are diverse and include non-parametric and high-dimensional statistics. Details below.

Research interests

  • Non-parametric statistics and high-dimensional statistics
  • Bioinformatics and computational biology
  • Statistical genetics
  • Neuroimaging methods
  • Bayesian modelling.

Supervision

Publications

Article

  • Zhang, J. (2019). Forward Beamforming and Inferring Functional Connectivity with MEG Data. TAB.
    The conventional beamformers that reconstruct the cerebral origin of brain activity measured outside the head via electro- and magnetoencephalography (EEG/MEG) suffer from depth bias and smearing of nearby sources.
    Here, to meet these methodological challenges, we propose a depth-invariant and forward beamformer for magnetoencephalography (MEG) data. Based on the new proposal, we further develop a two-step approach for inferring functional connectivity in the brain.
    The proposed methodology has nice features including its invariance with respect to source depths in the brain, nulling smearing of nearby sources and allowing for time-varying source orientations. We illustrate the new approach with MEG data derived from a face-perception experiment, revealing patterns of functional connectivity for face perception. We identify a set of brain regions where their responses and connectivity are significantly varying when stimuli alter between faces and scrambled faces.
    By simulation studies, we show that the proposed forward beamformer can outperform the forward methods based on conventional beamformers in terms of localization bias.
  • Ding, H., Lu, Z., Zhang, J. and Zhang, R. (2018). Semi-functional partial linear quantile regression. Statistics & Probability Letters [Online] 142:92-101. Available at: https://doi.org/10.1016/j.spl.2018.07.007.
    Semi-functional partial linear model is a flexible model in which a scalar response is related to both functional covariate and scalar covariates. We propose a quantile estimation of this model as an alternative to the least square approach. We also extend the proposed method to kNN quantile method. Under some regular conditions, we establish the asymptotic normality of quantile estimators of regression coefficient. We also derive the rates of convergence of nonparametric function. Finite-sample performance of our estimation is compared with least square approach via a Monte Carlo simulation study. The simulation results indicate that our method is much more robust than the least square method. A real data example about spectrometric data is used to illustrate that our model and approach are promising.
  • Ding, H., Zhang, R. and Zhang, J. (2018). Quantile Estimation for a Hybrid Model of Functional and Varying Coefficient Regressions. Journal of Statistical Planning and Inference [Online] 196:1-18. Available at: https://doi.org/10.1016/j.jspi.2017.10.005.
    We consider a hybrid of functional and varying-coefficient regression models for the analysis of mixed functional data. We propose a quantile estimation of this hybrid model as an alternative to the least square approach. Under regularity conditions, we establish the asymptotic normality of the proposed estimator. We show that the estimated slope function can attain the minimax convergence rate as in functional linear regression. A Monte Carlo simulation study and a real data application suggest that the proposed estimation is promising.
  • Ali, F. and Zhang, J. (2017). Mixture Model-Based Association Analysis with Case-Control Data in Genome Wide Association Studies. Statistical Applications in Genetics and Molecular Biology [Online] 16. Available at: http://dx.doi.org/10.1515/sagmb-2016-0022.
    Multilocus haplotype analysis of candidate variants with genome wide association studies (GWAS) data may provide evidence of association with disease, even when the individual loci themselves do not. Unfortunately, when a large number of candidate variants are investigated, identifying risk haplotypes can be very difficult. To meet the challenge, a number of approaches have been put forward in recent years. However, most of them are not directly linked to the disease-penetrances of haplotypes and thus may not be efficient. To fill this gap, we propose a mixture model-based approach for detecting risk haplotypes. Under the mixture model, haplotypes are clustered directly according to their estimated disease penetrances. A theoretical justification of the above model is provided. Furthermore, we introduce a hypothesis test for haplotype inheritance patterns which underpin this model. The performance of the proposed approach is evaluated by simulations and real data analysis. The simulation results show that the proposed approach outperforms an existing multiple testing method in terms of average specificity and sensitivity. We apply the proposed approach to analyzing two datasets on coronary artery disease and hypertension in the Wellcome Trust Case Control Consortium, identifying many more disease associated haplotype blocks than does the existing method.
  • Zhang, J. (2016). Screening and Clustering of Sparse Regressions with Finite Non-Gaussian Mixtures. Biometrics [Online] 73:540-550. Available at: http://dx.doi.org/10.1111/biom.12585.
    This article proposes a method to address the problem that can arise when covariates
    in a regression setting are not Gaussian, which may give rise to approximately mixture-distributed
    errors, or when a true mixture of regressions produced the data. The method begins with non-
    Gaussian mixture-based marginal variable screening, followed by fitting a full but relatively smaller
    mixture regression model to the selected data with help of a new penalization scheme. Under certain
    regularity conditions, the new screening procedure is shown to possess a sure screening property
    even when the population is heterogeneous. We further prove that there exists an elbow-point in
    the associated scree plot which results in a consistent estimator of the set of active covariates in
    the model. By simulations, we demonstrate that the new procedure can substantially improve the
    performance of the existing procedures in the content of variable screening and data clustering. By
    applying the proposed procedure to motif data analysis in molecular biology, we demonstrate that
    the new method holds promise in practice.
  • Zhang, J. and Su, L. (2016). Temporal Autocorrelation-Based Beamforming with MEG Neuroimaging Data. Journal of the American Statistical Association [Online] 110:1375-1388. Available at: http://dx.doi.org/10.1080/01621459.2015.1054488.
    Characterizing the brain source activity using Magnetoencephalography (MEG) requires solving an ill-posed inverse problem.
    Most source reconstruction procedures are performed in terms of power comparison. However, in the presence of voxel-specific noises, the direct power analysis can be misleading due to the power distortion as suggested by our multiple trial MEG study on a face-perception experiment. To tackle the issue, we propose a temporal autocorrelation-based method for the above analysis. The new method improves the face-perception analysis and identifies several differences between neuronal responses to face and scrambled-face stimuli. By the simulated and real data analyses, we demonstrate that compared to the existing methods, the new proposal can be more robust to voxel-specific noises without compromising on its accuracy in source localization. We further establish the consistency for estimating the proposed index when the number of sensors and the number of time instants are sufficiently large. In particular, we show that the proposed procedure can make a better focus on true sources than its precedents in terms of peak segregation coefficient.
  • Ali, F. and Zhang, J. (2015). Screening tests for Disease Risk Haplotype Segments in Genome by Use of Permutation. Journal of Systems Science and Mathematical Sciences [Online] 35:1402-1417. Available at: http://en.cnki.com.cn/Journal_en/A-A003-STYS-2015-12.htm.
    The haplotype association analysis has been proposed to capture the collective behavior of sets of variants by testing the association of each set instead of individual variants with the disease. Such an analysis typically involves a list of unphased multiple-locus genotypes with potentially sparse frequencies in cases and controls. It starts with inferring haplotypes from genotypes followed by a haplotype co-classification and marginal screening for disease-associated haplotypes. Unfortunately, phasing uncertainty may have a strong effects on the haplotype co-classification and therefore on the accuracy of predicting risk haplotypes. Here, to address the issue, we propose an alternative approach: In Stage 1, we select potential risk genotypes instead of co-classification of the inferred haplotypes. In Stage 2, we infer risk haplotypes from the genotypes inferred from the previous stage. The performance of the proposed procedure is assessed by simulation studies and a real data analysis. Compared to the existing multiple Z-test procedure, we find that the power of genome-wide association studies can be increased by using the proposed procedure.
  • Zhang, J. (2015). On Nonparametric Feature Filters in Electromagnetic Imaging. Journal of Statistical Planning and Inference [Online] 164:39-53. Available at: http://dx.doi.org/10.1016/j.jspi.2015.03.004.
    Estimation of sparse time-varying coefficients on the basis of time-dependent observations is one of the most challenging problems in statistics. Our study was mainly motivated
    from magnetoencephalographic neuroimaging, where we want to
    identify neural activities using the magnetoencephalographic sensor measurements outside the brain. The problem is ill-posed since the observed magnetic field could result from an infinite number of possible neuronal sources. The so-called minimum-variance beamformer is one of data-adaptive nonparametric feature filters to address the above problem in the literature. In this paper, we propose a method of sure feature filtering for a high-dimensional time-varying coefficient model. The new method assumes that the correlation structure of the sensor measurements can be well represented by a set of non-orthogonal
    variance-covariance components. We develop a theory on the sure screening property of the proposed filters and on when the beamformer-based location estimators are consistent or inconsistent with the true ones. We also derive the lower and upper bounds for the mean filtering errors of the proposed method. The new theory is further supported by simulations and a real data analysis.
  • Zhang, J. and Liu, C. (2015). On Linearly Constrained Minimum Variance Beamforming. Journal of Machine Learning Research [Online] 16. Available at: https://dl.acm.org/doi/10.5555/2789272.2886818.
    Beamforming is a widely used technique for source localization in signal processing and neuroimaging. A number of vector-beamformers have been introduced to localize neuronal activity by using
    magnetoencephalography (MEG) data in the literature. However, the existing theoretical analyses
    on these beamformers have been limited to simple cases, where no more than two sources are allowed in the associated model and the theoretical sensor covariance
    is also assumed known. The information about the effects of the MEG spatial and temporal dimensions on the consistency of vector-beamforming is incomplete.
    In the present study, we consider a class of vector-beamformers defined by thresholding the sensor covariance matrix, which include the standard vector-beamformer as a special case.
    A general asymptotic theory is developed for
    these vector-beamformers, which shows the extent of effects to which the MEG spatial and temporal dimensions on estimating the neuronal activity index. The performances of the proposed beamformers are assessed by simulation studies. Superior performances of the proposed beamformers are obtained
    when the signal-to-noise ratio is low.
    We apply the proposed procedure to real MEG datasets derived from five sessions of a human face-perception experiment, finding several highly active areas in the brain. A good agreement between these findings and the known neurophysiology of the MEG response to human face perception is shown.
  • Ali, F. and Zhang, J. (2015). Search for Risk Haplotype Segments with GWAS Data by Use of Finite Mixture Models. Statistics and its interface [Online] 9:267-280. Available at: http://dx.doi.org/10.4310/SII.2016.v9.n3.a2.
    The region-based association analysis has been proposed to capture the
    collective behavior of sets of variants by testing the association of each set instead of individual variants with the disease. Such an analysis typically
    involves a list of unphased multiple-locus genotypes with
    potentially sparse frequencies in cases and controls.
    To tackle the problem of the sparse distribution, a two-stage approach was proposed in literature: In the first stage, haplotypes are computationally inferred from genotypes, followed by a haplotype co-classification. In the second stage, the association analysis is performed on the inferred haplotype groups. If a haplotype is unevenly distributed between the case and control samples, this
    haplotype is labeled as a risk haplotype. Unfortunately, the in-silico reconstruction of haplotypes might produce a proportion of
    false haplotypes which hamper the detection of rare but true
    haplotypes. Here, to address the issue, we propose an alternative approach: In Stage 1, we cluster genotypes instead of inferred haplotypes and estimate the
    risk genotypes based on a finite mixture model. In Stage 2, we infer risk haplotypes from risk genotypes inferred from the
    previous stage.
    To estimate the finite mixture model, we propose an EM algorithm with a novel data partition-based initialization.
    The performance of the proposed procedure is assessed by
    simulation studies and a real data analysis. Compared to the existing
    multiple Z-test procedure, we find that the power of genome-wide association studies can be increased by using the proposed procedure.
  • Zhang, J., Liu, C. and Green, G. (2014). Source Localization with MEG Data: A Beamforming Approach Based on Covariance Thresholding. Biometrics [Online] 70:121-131. Available at: http://dx.doi.org/10.1111/biom.12123.
    Reconstructing neural activities using non-invasive sensor arrays outside the brain is an ill-posed inverse problem
    since the observed sensor measurements could result from an infinite number of possible neuronal sources. The sensor
    covariance-based beamformer mapping represents a popular and simple solution to the above problem. In this article, we
    propose a family of beamformers by using covariance thresholding. A general theory is developed on how their spatial and
    temporal dimensions determine their performance. Conditions are provided for the convergence rate of the associated beamformer
    estimation. The implications of the theory are illustrated by simulations and a real data analysis.
  • Zhang, J. (2013). Epistatic Clustering: A Model-Based Approach for Identifying Links Between Clusters. Journal of the American Statistical Association [Online] 108:1366-1384. Available at: http://dx.doi.org/10.1080/01621459.2013.835661.
    Most clustering methods assume that the data can be represented by mutually exclusive clusters, although this assumption may not be the
    case in practice. For example, in gene expression microarray studies, investigators have often found that a gene can play multiple functions
    in a cell and may, therefore, belong to more than one cluster simultaneously, and that gene clusters can be linked to each other in certain
    pathways. This article examines the effect of the above assumption on the likelihood of finding latent clusters using theoretical calculations
    and simulation studies, for which the epistatic structures were known in advance, and on real data analyses. To explore potential links
    between clusters, we introduce an epistatic mixture model which extends the Gaussian mixture by including epistatic terms. A generalized
    expectation-maximization (EM) algorithm is developed to compute the related maximum likelihood estimators. The Bayesian information
    criterion is then used to determine the order of the proposed model. A bootstrap test is proposed for testing whether the epistatic mixture
    model is a significantly better fit to the data than a standard mixture model in which each data point belongs to one cluster. The asymptotic
    properties of the proposed estimators are also investigated when the number of analysis units is large. The results demonstrate that the
    epistatic links between clusters do have a serious effect on the accuracy of clustering and that our epistatic approach can substantially reduce
    such an effect and improve the fit.
  • Zhang, J. (2012). Generalized plaid models. Neurocomputing [Online] 79:95-104. Available at: http://dx.doi.org/10.1016/j.neucom.2011.10.011.
    The problem of two-way clustering has attracted considerable attention in diverse research areas such as functional genomics, text mining, and market research, where people want to simultaneously cluster rows and columns of a data matrix. In this paper, we propose a family of generalized plaid models for two-way clustering, where the layer estimation is regularized by Bayesian Information Criterion (BIC).
    The new models have broadened the scope of ordinary plaid models by
    specifying the variance function to make the models adaptive to the entire distribution of the error term. A formal test is provided for finding significant layers. A Metropolis algorithm is also developed to calculate the maximum likelihood estimators of unknown parameters in the proposed models. Three simulation studies and the applications to two real datasets are reported, which demonstrate that our procedure is promising.
  • Zhang, J. and Liang, F. (2010). Robust Clustering Using Exponential Power Mixtures. Biometrics [Online] 66:1078-1086. Available at: http://dx.doi.org/10.1111/j.1541-0420.2010.01389.x.
    Clustering is a widely used method in extracting useful information from gene expression data, where unknown
    correlation structures in genes are believed to persist even after normalization. Such correlation structures pose a great
    challenge on the conventional clustering methods, such as the Gaussian mixture (GM) model, k-means (KM), and partitioning
    around medoids (PAM), which are not robust against general dependence within data. Here we use the exponential
    power mixture model to increase the robustness of clustering against general dependence and nonnormality of the data. An
    expectation–conditional maximization algorithm is developed to calculate the maximum likelihood estimators (MLEs) of the
    unknown parameters in these mixtures. The Bayesian information criterion is then employed to determine the numbers of
    components of the mixture. The MLEs are shown to be consistent under sparse dependence. Our numerical results indicate
    that the proposed procedure outperforms GM, KM, and PAM when there are strong correlations or non-Gaussian components
    in the data.
  • Zhang, J. (2010). A Bayesian model for biclustering with applications. Journal of the Royal Statistical Society: Series C (Applied Statistics) [Online] 59:635-656. Available at: http://dx.doi.org/10.1111/j.1467-9876.2010.00716.x.
    The paper proposes a Bayesian method for biclustering with applications to gene
    microarray studies, where we want to cluster genes and experimental conditions simultaneously.
    We begin by embedding bicluster analysis into the framework of a plaid model with random
    effects.The corresponding likelihood is then regularized by the hierarchical priors in each layer.
    The resulting posterior, which is asymptotically equivalent to a penalized likelihood, can attenuate
    the effect of high dimensionality on cluster predictions. We provide an empirical Bayes
    algorithm for sampling posteriors, in which we estimate the cluster memberships of all genes
    and samples by maximizing an explicit marginal posterior of these memberships.The new algorithm
    makes the estimation of the Bayesian plaid model computationally feasible and efficient.
    The performance of our procedure is evaluated on both simulated and real microarray gene
    expression data sets. The numerical results show that our proposal substantially outperforms
    the original plaid model in terms of misclassification rates across a range of scenarios. Applying
    our method to two yeast gene expression data sets, we identify several new biclusters which
    show the enrichment of known annotations of yeast genes.
  • Zhang, J. (2009). Learning Bayesian networks for discrete data. Computational Statistics and Data Analysis [Online] 53:865-876. Available at: http://dx.doi.org/10.1016/j.csda.2008.10.007.
    Bayesian networks have received much attention in the recent literature. In this article,
    we propose an approach to learn Bayesian networks using the stochastic approximation
    Monte Carlo (SAMC) algorithm. Our approach has two nice features. Firstly, it possesses
    the self-adjusting mechanism and thus avoids essentially the local-trap problem suffered
    by conventional MCMC simulation-based approaches in learning Bayesian networks.
    Secondly, it falls into the class of dynamic importance sampling algorithms; the network
    features can be inferred by dynamically weighted averaging the samples generated in the
    learning process, and the resulting estimates can have much lower variation than the single
    model-based estimates. The numerical results indicate that our approach can mix much
    faster over the space of Bayesian networks than the conventional MCMC simulation-based
    approaches.
  • Zhang, J. and Liang, F. (2008). Estimating the false discovery rate using the stochastic approximation algorithm. Biometrika [Online] 95:961-977. Available at: http://dx.doi.org/10.1093/biomet/asn036.
    Testing of multiple hypotheses involves statistics that are strongly dependent in some applications,
    but most work on this subject is based on the assumption of independence. We propose
    a new method for estimating the false discovery rate of multiple hypothesis tests, in which the
    density of test scores is estimated parametrically by minimizing the Kullback–Leibler distance
    between the unknown density and its estimator using the stochastic approximation algorithm,
    and the false discovery rate is estimated using the ensemble averaging method. Our method is
    applicable under general dependence between test statistics. Numerical comparisons between our
    method and several competitors, conducted on simulated and real data examples, show that our
    method achieves more accurate control of the false discovery rate in almost all scenarios.
  • Zhang, J., Regieli, J., Schipper, M., Entius, M., Liang, F., Koerselman, J., Ruven, H., van der Graaf, Y., Grobbee, D. and Doevendans, P. (2008). Inflammatory Gene Haplotype-Interaction Networks Involved in Coronary Collateral Formation. Human Heredity [Online] 66:252-264. Available at: http://dx.doi.org/10.1159/000143407.
    Objectives: Formation of collateral circulation is an endogenous
    response to atherosclerosis, and is a natural escape
    mechanism by re-routing blood. Inflammatory responserelated
    genes underlie the formation of coronary collaterals.
    We explored the genetic basis of collateral formation in man
    postulating interaction networks between functional Single
    Nucleotide Polymorphisms (SNPs) in these inflammatory
    gene candidates. Methods: The contribution of 41 genes as
    well as the interactions among them was examined in a cohort
    of 226 coronary artery disease patients, genotyped for
    54 candidate SNPs. Patients were classified to the extent of
    collateral circulation. Stepwise logistic regression analysis
    and a haplotype entropy procedure were applied to search
    for haplotype interactions among all 54 polymorphisms.
    Multiple testing was addressed by using the false discovery
    rate (FDR) method. Results: The population comprised 84
    patients with and 142 without visible collaterals. Among the
    41 genes, 16 pairs of SNPs were implicated in the development
    of collaterals with the FDR of 0.19. Nine SNPs were
    found to potentially have main effects on collateral formation.
    Two sets of coupling haplotypes that predispose to collateral
    formation were suggested. Conclusions: These findings
    suggest that collateral formation may arise from the
    interactions between several SNPs in inflammatory response
    related genes, which may represent targets in future studies
    of collateral formation. This may enhance developing strategies
    for risk stratification and therapeutic stimulation of arteriogenesis.
  • van Greevenbroek, M., Zhang, J., van der Kallen, C., Schiffers, P., Feskens, E. and de Bruin, T. (2008). Effects of interacting networks of cardiovascular risk genes on the risk of type 2 diabetes mellitus (the CODAM study). BMC Medical Genetics [Online] 9. Available at: http://dx.doi.org/3610.1186/1471-2350-9-36.
    Background: Genetic dissection of complex diseases requires innovative approaches for identification of disease-predisposing genes. A well-known example of a human complex disease with a strong genetic component is Type 2 Diabetes Mellitus (T2DM). Methods: We genotyped normal-glucose-tolerant subjects (NGT; n = 54), subjects with an impaired glucose metabolism (IGM; n = 111) and T2DM (n = 142) subjects, in an assay (designed by Roche Molecular Systems) for detection of 68 polymorphisms in 36 cardiovascular risk genes. Using the single-locus logistic regression and the so-called haplotype entropy, we explored the possibility that (1) common pathways underlie development of T2DM and cardiovascular disease which would imply enrichment of cardiovascular risk polymorphisms in "pre-diabetic" (IGM) and diabetic (T2DM) populations- and (2) that gene-gene interactions are relevant for the effects of risk polymorphisms. Results: In single-locus analyses, we showed suggestive association with disturbed glucose metabolism (i.e. subjects who were either IGM or had T2DM), or with T2DM only. Moreover, in the haplotype entropy analysis, we identified a total of 14 pairs of polymorphisms (with a false discovery rate of 0.125) that may confer risk of disturbed glucose metabolism, or T2DM only, as members of interacting networks of genes. We substantiated gene-gene interactions by showing that these interacting networks can indeed identify potential "disease-predisposing allele-combinations". Conclusion: Gene-gene interactions of cardiovascular risk polymorphisms can be detected in prediabetes and T2DM, supporting the hypothesis that common pathways may underlie development of T2DM and cardiovascular disease. Thus, a specific set of risk polymorphisms, when simultaneously present, increases the risk of disease and hence is indeed relevant in the transfer of risk.
  • Zhang, J. and Liang, F. (2008). Convergence of Stochastic approximation algorithm under irregular conditions. Statistica Neerlandica [Online] 62:393-403. Available at: http://dx.doi.org/10.1111/j.1467-9574.2008.00397.x.
    We consider a class of stochastic approximation (SA) algorithms for
    solving a system of estimating equations. The standard condition for
    the convergence of the SA algorithms is that the estimating functions
    are locally Lipschitz continuous. Here, we show that this condition can
    be relaxed to the extent that the estimating functions are bounded
    and continuous almost everywhere. As a consequence, the use of the
    SA algorithm can be extended to some problems with irregular estimating
    functions. Our theoretical results are illustrated by solving an
    estimation problem for exponential power mixture models.
  • Ahmad, N., Zhang, J., Brown, P., James, D., Birch, J., Racher, A. and Smales, C. (2006). On the statistical analysis of the GS-NS0 cell proteome: Imputation, clustering and variability testing. Biochimica Et Biophysica Acta-Proteins and Proteomics [Online] 1764:1179-1187. Available at: http://dx.doi.org/10.1016/j.bbapap.2006.05.002.
    We have undertaken two-dimensional gel electrophoresis proteomic profiling on a series of cell lines with different recombinant antibody production rates. Due to the nature of gel-based experiments not all protein spots are detected across all samples in an experiment, and hence datasets are invariably incomplete. New approaches are therefore required for the analysis of such graduated datasets. We approached this problem in two ways. Firstly, we applied a missing value imputation technique to calculate missing data points. Secondly, we combined a singular value decomposition based hierarchical clustering with the expression variability test to identify protein spots whose expression correlates with increased antibody production. The results have shown that while imputation of missing data was a useful method to improve the statistical analysis of such data sets, this was of limited use in differentiating between the samples investigated, and highlighted a small number of candidate proteins for further investigation.
  • Fan, J. and Zhang, J. (2004). Sieve empirical likelihood ratio tests for nonparametric functions. Annals of Statistics 32:1858-1907.
    Generalized likelihood ratio statistics have been proposed in Fan, Zhang and Zhang [Ann. Statist. 29 (2001) 153-193] as a generally applicable method for testing norparametic hypotheses about nonparametric functions. The likelihood ratio statistics are constructed based on the assumption that the distributions of stochastic errors are in a certain parametric family. We extend their work to the case where the error distribution is completely unspecified via newly proposed sieve empirical likelihood ratio (SELR) tests. The approach is also applied to test conditional estimating equations on the distributions of stochastic errors. It is shown that the proposed SELR statistics follow asymptotically resealed chi(2)-distributions, with the scale constants and the degrees of freedom being independent of the nuisance parameters. This demonstrates that the Wilks phenomenon observed in Fan, Zhang and Zhang [Ann. Statist. 29 (2001) 153-193] continues to hold under more relaxed models and a larger class of techniques. The asymptotic power of the proposed test is also derived, which achieves the optimal rate for nonparametric hypothesis testing. The proposed approach has two advantages over the generalized likelihood ratio method: it requires one only to specify some conditional estimating equations rather than the entire distribution of the stochastic error, and the procedure adapts automatically to the unknown error distribution including heteroscedasticity. A simulation study is conducted to evaluate our proposed procedure empirically.
  • Zhang, J., Liang, F., Dassen, W., Doevendans, P. and de Gunst, M. (2003). Search for haplotype interactions that influence susceptibility to type 1 diabetes, through use of unphased genotype data. American Journal of Human Genetics [Online] 73:1385-1401. Available at: https://doi.org/DOI not available.
    Type 1 diabetes is a T-cell-mediated chronic disease characterized by the autoimmune destruction of pancreatic insulin-producing beta cells and complete insulin deficiency. It is the result of a complex interrelation of genetic and environmental factors, most of which have yet to be identified. Simultaneous identification of these genetic factors, through use of unphased genotype data, has received increasing attention in the past few years. Several approaches have been described, such as the modified transmission/disequilibrium test procedure, the conditional extended transmission/disequilibrium test, and the stepwise logistic-regression procedure. These approaches are limited either by being restricted to family data or by ignoring so-called "haplotype interactions" between alleles. To overcome this limit, the present study provides a general method to identify, on the basis of unphased genotype data, the haplotype blocks that interact to define the risk for a complex disease. The principle underpinning the proposal is minimal entropy. The performance of our procedure is illustrated for both simulated and real data. In particular, for a set of Dutch type 1 diabetes data, our procedure suggests some novel evidence of the interactions between and within haplotype blocks that are across chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 15, 16, 17, 19, and 21. The results demonstrate that, by considering interactions between potential disease haplotype blocks, we may succeed in identifying disease-predisposing genetic variants that might otherwise have remained undetected.
  • Zhang, J. and Gijbels, I. (2003). Sieve empirical likelihood and extensions of the generalized least squares. Scandinavian Journal of Statistics [Online] 30:1-24. Available at: http://dx.doi.org/10.1111/1467-9469.t01-1-00315.
    The empirical likelihood cannot be used directly sometimes when an infinite dimensional parameter of interest is involved. To overcome this difficulty, the sieve empirical likelihoods are introduced in this paper. Based on the sieve empirical likelihoods, a unified procedure is developed for estimation of constrained parametric or non-parametric regression models with unspecified error distributions. It shows some interesting connections with certain extensions of the generalized least squares approach. A general asymptotic theory is provided. In the parametric regression setting it is shown that under certain regularity conditions the proposed estimators are asymptotically efficient even if the restriction functions are discontinuous. In the non-parametric regression setting the convergence rate of the maximum estimator based on the sieve empirical likelihood is given. In both settings, it is shown that the estimator is adaptive for the inhomogeneity of conditional error distributions with respect to predictor, especially for heteroscedasticity.
  • Fan, J., Zhang, C. and Zhang, J. (2001). Generalized likelihood ratio statistics and Wilks phenomenon. Annals of Statistics 29:153-193.
    Likelihood ratio theory has had tremendous success in parametric inference, due to the fundamental theory of Wilks. Yet, there is no general applicable approach for nonparametric inferences based on function estimation. Maximum likelihood ratio test statistics in general may not exist in nonparametric function estimation setting. Even if they exist, they are hard to find and can not; be optimal as shown in this paper. We introduce the generalized likelihood statistics to overcome the drawbacks of nonparametric maximum likelihood ratio statistics. A new S Wilks phenomenon is unveiled. We demonstrate that a class of the generalized likelihood statistics based on some appropriate nonparametric estimators are asymptotically distribution free and follow chi (2)-distributions under null hypotheses for a number of useful hypotheses and a variety of useful models including Gaussian white noise models, nonparametric regression models, varying coefficient models and generalized varying coefficient models. We further demonstrate that generalized likelihood ratio statistics are asymptotically optimal in the sense that they achieve optimal rates of convergence given by Ingster. They can even be adaptively optimal in the sense of Spokoiny by using a simple choice of adaptive smoothing parameter. Our work indicates that the generalized likelihood ratio statistics are indeed general and powerful for nonparametric testing problems based on function estimation.

Book section

  • Oftadeh, E. and Zhang, J. (2019). Bayesian Mixture Models with Weight-Dependent Component Priors for Bayesian Clustering. In: The Festschrift in Honour of Professor Kai-Tai Fang’s 80 Birthday. Springer, pp. 1-15.
    In the conventional Bayesian mixture models, independent priors are often assigned to weights and component parameters. This may cause bias in estimation of missing group memberships due to the domination of these priors for some components when there is a big variation across component weights. To tackle this issue, we propose weight-dependent priors for component parameters. To implement the proposal, we develop a simple coordinate-wise updating algorithm for finding empirical Bayesian estimator of allocation or labelling vector of observations. We conduct a simulation study to show that the new method can outperform the existing approaches in terms of adjusted Rand index.
  • Alston, M., Johnson, C. and Robinson, G. (2003). Colour merging for the visualization of biomolecular sequence data. In: Banissi, E., Borner, K., Chen, C., Clapworthy, G., Maple, C., Lobben, A., Moore, C. J., Roberts, J. C., Ursyn, A. and Zhang, J. eds. Proceedings on Seventh International Conference on Information Visualization, 2003. IEEE, pp. 169-175. Available at: http://dx.doi.org/10.1109/IV.2003.1217975.
    We introduce a novel technique for the visualization of data at various levels of detail. This is based on a colour-based representation of the data, where "high level" views of the data are obtained by merging colours together to obtain a summary-colour which represents a number of data-points. This is applied to the problem of visualizing biomolecular sequence data and picking out features in such data at various scales.
  • Banissi, E., Borner, K., Chen, C., Clapworthy, G., Maple, C., Lobben, A., Moore, C.J., Roberts, J.C., Ursyn, A. and Zhang, J. eds. (2003). Seventh International Conference on Information Visualization (IV 03). In: International Conference on Information Visualization (IV 2003). IEEE Computer Society Press,U.S. Available at: http://www.cs.kent.ac.uk/pubs/2003/2465.

Edited book

  • Banissi, E., Borner, K., Chen, C., Dastbaz, M., Clapworthy, G., Faiola, A., Izquierdo, E., Maple, C., Roberts, J.C., Moore, C.J., Ursyn, A. and Zhang, J. eds. (2004). Eighth International Conference on Information Visualisation, 2004. IV 2004. IEEE.

Thesis

  • Oftadeh, E. (2017). Complex Modelling of Multi-Outcome Data With Applications to Cancer Biology.
    In applied scientific areas such as economics, finance, biology, and medicine, it is often required to find the relationship between a set of independent variables (predictors) and a set of response variables (i.e., outcomes of an experiment). If we model individual outcomes separately, we potentially miss information of the correlation among outcomes. Therefore, it is desirable to model these outcomes simultaneously by multivariate linear regressions.
    With the advent of high-throughput technology, there is an enormous amount of high dimensional multivariate regression data being generated at an extraordinary speed. However, only a small proportion of them are informative. This has imposed a challenge on modern statistics because of this high dimensionality. In this work, we propose methods and algorithms for modelling high-dimensional multivariate regression data. The contributions of this thesis are as follows.

    Firstly, we propose two variable screening techniques to reduce the high dimension of predictors. One is a beamforming-based screening method which is based on a statistic called SNR. The second approach is a mixture-based screening where the screening is conducted through the so-called likelihood fusion.

    Secondly, we propose a variable selection method called principal variable analysis (PVA). In PVA we take into account the correlation between response variables in the process of variable selection. We compare PVA with some of well-known variable selection methods by simulation studies, showing that PVA can substantially enhance the selection accuracy.

    Thirdly, we develop a method for clustering and variable selection simultaneously, by using the likelihood fusion. We show the feature of the proposed method by simulation studies.

    Fourthly, we study a Bayesian clustering problem through the mixture of normal distributions where we propose mixing-proportion dependent priors for component parameters.

    Finally, we apply the proposed methods to cancer drug data. This data contain expression levels of 13321 genes across 42 cell lines and the responses of these cell lines to 131 drugs, recorded as fifty percent inhibitory concentration (IC50) values. We identify 37 genes which are important for predicting IC50 values. We found that although the expressions of these genes are weakly correlated, they are highly correlated in terms of their regression coefficients. We also identify a regression coefficient-based network between genes. We also show that 34 out of 37 selected genes have played certain roles in at least one type of cancer.
    Moreover, by applying the likelihood fusion model to real data we classify the drugs into five groups.
  • Ali, F. (2015). Statistical Methods For Detecting Genetic Risk Factors of a Disease With Applications to Genome-Wide Association Studies.
    This thesis aims to develop various statistical methods for analysing the data derived from genome wide association studies (GWAS).
    The GWAS often involves genotyping individual human genetic variation, using high-throughput genome-wide single nucleotide polymorphism (SNP) arrays, in thousands of individuals and testing for association between those variants and a given disease under the assumption of common disease/common variant.
    Although GWAS have identified many potential genetic factors in the genome that affect the risks to complex
    diseases, there is still much of the genetic heritability that remains unexplained. The power of
    detecting new genetic risk variants can be improved by considering multiple genetic variants simultaneously with novel statistical methods.
    Improving the analysis of the GWAS data has received much attention from statisticians and other scientific researchers over the past decade.

    There are several challenges arising in analysing the GWAS data. First, determining the risk SNPs might be difficult due to non-random correlation between SNPs that can inflate type I and II errors in statistical inference. When a group of SNPs are considered together in the context of haplotypes/genotypes, the distribution of the haplotypes/genotypes is sparse, which makes it difficult to detect risk haplotypes/genotypes in terms of disease penetrance.

    In this work, we proposed four new methods to identify risk haplotypes/genotypes based on their frequency differences between cases and controls. To evaluate the performances of our methods, we simulated datasets under wide range of scenarios according to both retrospective and prospective designs.

    In the first method, we first reconstruct haplotypes by using unphased genotypes, followed by clustering and thresholding the inferred haplotypes into risk and non-risk groups with a two-component binomial-mixture model. In the method, the parameters were estimated by using the modified Expectation-Maximization algorithm, where the maximisation step was replaced the posterior sampling of the component parameters. We also elucidated the relationships between risk and non-risk haplotypes under different modes of inheritance and genotypic relative risk.

    In the second method, we fitted a three-component mixture model to genotype data directly, followed by an odds-ratio thresholding.

    In the third method, we combined the existing haplotype reconstruction software PHASE and permutation method to infer risk haplotypes.

    In the fourth method, we proposed a new way to score the genotypes by clustering and combined it with a logistic regression approach to infer risk haplotypes.

    The simulation studies showed that the first three methods outperformed the multiple testing method of (Zhu, 2010) in terms of average specificity and sensitivity (AVSS) in all scenarios considered. The logistic regression methods also outperformed the standard logistic regression method.

    We applied our methods to two GWAS datasets on coronary artery disease (CAD) and hypertension (HT), detecting several new risk haplotypes and recovering a number of the existing disease-associated genetic variants in the literature.

Forthcoming

  • Zhang, J. and Li, J. (2019). Factorized Estimation of High-Dimensional Nonparametric Covariance Models. Annals of Statistics [Online]. Available at: https://www.imstat.org/aos/.
    Estimation of covariate-dependent conditional covariance matrix in a high-dimensional space poses a challenge to contemporary statistical research. The existing kernel estimators may not be locally adaptive due to using a single bandwidth to explore the smoothness of all entries of the target matrix function. Moreover, the corresponding theory holds only for i.i.d. samples although in most of applications, the samples are dependent. In this paper, we propose a novel estimation scheme to overcome these obstacles by using techniques of factorization, thresholding and optimal shrinkage. Under certain regularity conditions, we show that the proposed estimator is consistent with the underlying matrix even when the sample is dependent. We conduct a set of simulation studies to show that the proposed estimator significantly outperforms its competitors. We apply the proposed procedure to the analysis of an asset return dataset, identifying a number of interesting volatility and co-volatility patterns across different time periods.
  • Ding, H., Zhang, J. and Zhang, R. (2018). Nonparametric Variable Screening for Multivariate Additive Models. TBD.
    We propose a novel approach to nonparametric variable screening for sparse multivariate additive
    models with random effects, which includes two stages. In Stage 1, each nonparametric component is approximated by a linear combination of spline basis functions. Under this approximation, the above screening problem can be treated as selecting block-matrices of regression coefficients for a multivariate regression model. In Stage 2, a series of filtering operations are conducted by projections of the multiple response observations into the covariate space; each filter is tailored to a particular covariate and resistant to interferences originating from other covariates and from background
    noises. The filtering is further improved by sequentially nulling significant covariates detected in the previous steps. An asymptotic theory on the selection consistency has been established under some regularity conditions. By simulations, the proposed procedure is shown to outperform the existing procedures in terms of sensitivity and specificity over a wide range of scenarios. We apply the proposed approach to the integrative analysis of the anti-cancer drug data, identifying a few biomarkers that potentially influence the concentration of drugs in cancer cell lines.
  • Zhang, J. and Oftadeh, E. (2016). Multivariate Variable Selection through Use of Null-Beamforming: Principle Variable Analysis. TBD.
    This article extends the idea of principal component analysis to multivariate variable selection. The basic premise behind the proposal is to scan through a predictor variable space with a series of filters called null-beamformers; each is
    tailored to a particular region in the space and resistant to interference effects originating from
    other regions. This gives rise to a predictive power map for predictor selection. The new approach attempts to explore the maximum amount of variation in the data with a small number of principal variables. Applying the proposal to simulated data and real cancer drug data, we show that it outperforms the existing methods in terms of sensitivity and specificity.
Last updated