2004 NESS CONTRIBUTED ABSTRACTS

Alvarez, Enrique - Estimation in stationary Markov renewal processes, with application to earthquake forecasting in Turkey

Choi, Jai Won - A Bayesian Alternative to the Chi-Squared Test of Association in a Two-Way Categorical Table with Intra-Class Correlation

Das, Sonali - Investigating the Effects of Working Conditions on Health Care Quality via Structural Equation Models

Demidenko, Eugene - Clustered Poisson regression or why I don't like GEE

Durairajan, T.M. - Some Information Bounds and Asymptotic Variances

Erhardt, Erik - Bayesian Simultaneous Intervals for Small Area Estimation: An Application to Mapping Mortality Rates in U.S. Health Service Areas

Haughton, Dominique - How to measure age, race and gender effects in ticketing for speeding in Massachusetts

Huang, Lan - Modeling Repeated Binary Responses and Time-Dependent Missing Covariates with Application to a Tree with a Two-Year Periodicity in Flowering Intensity

Jensen, Shane - Prediction of Co-Regulated Genes using Motif Discovery and Clustering

Lai, Yinglei - Statistical methods for identifying differential gene-gene interaction patterns

Levine, Michael - ESTIMATING VARIANCE-COVARIANCE STRUCTURE OF THE NONPARAMETRIC REGRESSION DATA WITH TIME SERIES ERRORS

Li, Lingling - A COMPARISON OF GOODNESS OF FIT TESTS FOR THE LOGISTIC GEE MODEL

Liu, Junfeng - A Statistical Model for Multiple High-throughput Protein-Protein Interaction Assay Assessments

Liu, Zhaohui - Bayesian inference on stochastic volatility under hidden semi-Markov models

L'moudden, Ahmed - TEST OF INDEPENDNECE BASED ON KENDALL'S PROCESS: TABULATE THE PERCENTILES OF CRAMER-VON MESIS STATISTICS BY THE STURN-LIOUVILLE APPROACH

Smith, Robert - GLOBAL HUMAN DEVELOPMENT: EXPLAINING ITS REGIONAL VARIATIONS*

Song, Chang Hong - Zero-inflated Poisson Regression Models

Song, Seongho - Hierarchical models with migration, mutation, and drift: implications for genetic inference

Subramanian, Sundar - ASYMPTOTICALLY EFFICIENT ESTIMATION OF A SURVIVAL FUNCTION IN THE MISSING CENSORING INDICATOR MODEL

Wang, Steve - Statistical Challenges in the Analysis of Mass Extinctions

Wilbur, Jayson - A Two-Stage Nearest-Neighbor Classifier with Application to Microbial Source Tracking

Yu, Yaming - Imputing Missing Data by Monotone Blocks

Zhao, Yifang - Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments

Zhou, Qing - Equi-Energy Sampler With Applications to Mixture Model Simulation and Density of States Calculation














































































Estimation in stationary Markov renewal processes, with application to earthquake forecasting in Turkey


Enrique E. Alvarez

Department of Statistics
University of Connecticut

ealvarez@merlot.stat.uconn.edu


Consider a process in which different events occur, with random inter-occurrence times. In Markov renewal processes, the sequence of events is a Markov chain and the waiting distributions depend only on the types of the last and the next event. Suppose that the state-space is finite and that the process started far in the past, achieving stationary. Weibull distributions are proposed for the waiting times and their parameters are estimated jointly with the transition probabilities through maximum likelihood, when one or several realizations of the process are observed over finite windows. The model is illustrated with data of earthquakes of three types of severity that occurred in Turkey during the 20th century.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































A Bayesian Alternative to the Chi-Squared Test of Association in a Two-Way Categorical Table with Intra-Class Correlation


Balgobin Nandram* and Jai Won Choi**

Worcester Polytechnic Institute* and National Center for Health Statistics**

balnan@wpi.edu* and jwc7@cdc.gov**


It is straight forward to analyze data from a single multinomial table. Specifically, for the analysis of a two-way categorical table, the common chi-squared test of independence between the two variables and maximum likelihood estimators are readily available. When the counts in the two-way categorical table are formed from familial data (clusters of correlated data), the common chi-squared test no longer applies. We note that there are several approximate adjustments to the common chi-squared test. However, our main contribution is the construction and analysis of a Bayesian model which removes all analytical approximations. This is an extension of a standard multinomial-Dirichlet model to include the intra-class correlation associated with the individuals within a cluster. This intra-class correlation varies with the size of the cluster, but we assume that it is the same for all clusters of the same size for the same variable. We use Markov chain Monte Carlo methods to fit our model, and to make posterior inference about the intra-class correlations and the cell probabilities. We use data from the National Health Interview Survey to show how our alternative test performs and to obtain the posterior density of the cell probabilities. Also, using Monte Carlo integration with a binomial importance function, we obtain the Bayes factor for a test of no association.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Investigating the Effects of Working Conditions on Health Care Quality via Structural Equation Models


Sonali Das*, Ming-Hui Chen*, Nicholas Warren+, and Dipak Dey*

*Department of Statistics
University of Connecticut

+Assistant Professor of Medicine/Ergonomics Coordinator,
University of Connecticut Health Center

*sonali@stat.uconn.edu
*mhchen@stat.uconn.edu
+warren@nso.uchc.edu
*dey@stat.uconn.edu


The issue of health care quality has recently received significant attention both within and outside the medical community. Working conditions in health care settings, including job health and safety, have direct and indirect impacts on health care quality. Much previous research has tended to focus on a single level or group factors that affect health care quality. In this work based on a large national survey, we identify a web of factors influencing employee's perception of the organization, and outcomes of employee behavior via variables such as satisfaction, stress, turnover intention and perception of quality. In this study, we discuss different structural equation models (SEMs) based on path analysis consisting of latent and indicator (manifest) variables. The aim here is to model the relationships to get the "best" predicted covariance structure. We discuss some results, their implications and modifications to the models.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Clustered Poisson regression or why I don't like GEE


Eugene Demidenko

Dartmouth College, NH

eugened@dartmouth.edu


Statistical properties of a Poisson regression with random intercepts are studied. This is a perfect model to compare quasi-likelihood estimation, such as GEE, with maximum likelihood because both use the same model up to a factor. In other words, the marginal model (GEE) and conditional model (random effect) are the same for Poisson regression with random intercepts. Five methods of estimation for clustered Poisson regression are considered: standard Poisson regression (naïve), fixed-intercept Poisson model, GEE, Exact GEE (EGEE), and MLE. The beauty of the Poisson model is that the exact covariance matrix can be computed in closed from that is called EGEE. All five methods produce consistent estimates for regression coefficients but have different efficiency in large samples. We derive the asymptotic covariance matrix for each method for any distribution of the random intercept. We analytically compare and test the methods via simulations. The five split into two groups: naïve Poisson & GEE and the rest. Although the compound symmetry structure seems natural for the random-intercept model this working correlation structure never coincides with the true one. Consequently, GEE loses much efficiency and becomes not more efficient than naïve Poisson regression, which ignores cluster correlation. On the other hand, fixed-intercept approach, EGEE, and MLE are very close and are the same if the data are balanced.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Some Information Bounds and Asymptotic Variances


T.M. Durairajan

Department of Statistics
Loyola College
Chennai, India

tmdurairajan2001@yahoo.co.uk


In large sample methods for inference, there are several examples of asymptotic distributions of statistics whose asymptotic variances are not related to Fisher Information. In this paper, we define various information measures and obtain different information bounds which are the inverse of asymtotic variances in different situations. We also relate these information bounds to the estimation of parameters of interest in the presence of the nuisance parameters in finite sample.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Bayesian Simultaneous Intervals for Small Area Estimation: An Application to Mapping Mortality Rates in U.S. Health Service Areas


Erik Barry Erhardt

Graduate Student, WPI

erike@WPI.EDU


It is customary when presenting a choropleth map of rates or counts to present only the estimates (mean or mode) of the parameters of interest. While this technique illustrates spatial variation, it ignores the variation inherent in the estimates. We describe an approach to present variability in choropleth maps by constructing 100(1-alpha)% simultaneous intervals. The result provides three maps (estimate with two bands).

We propose two methods to construct simultaneous intervals from the optimal individual highest posterior density (HPD) intervals to ensure joint simultaneous coverage of 100(1-alpha)%.

Both methods exhibit the main feature of multiplying the lower bound and dividing the upper bound of the individual HPD intervals by parameters 0
For illustrative purposes we apply our methods to chronic obstructive pulmonary disease (COPD) mortality rates from 1988--92, subset White Males age group 65 and older, for the continental United States consisting of 798 Health Service Areas (HSA).


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































How to measure age, race and gender effects in ticketing for speeding in Massachusetts


Dominique Haughton
Bentley College

Phong Nguyen
Bentley College and General Statistics Office, Hanoi

dhaughton@bentley.edu


The claim is sometimes made, as most recently in the Boston Globe (7/20/2003) that "race, sex and age drive ticketing" by police on Massachusetts roads. We use a database of speeding tickets and warnings obtained by the Globe to build a logistic model of who gets ticketed and who only gets warned. The model involves a rather complicated non linear function of speed and speed over the speed limit as well as some interactions, identified with the help of MARS (Multiple Adaptive Regression Splines). In order to discuss the importance of the race, gender and age effects relative to the speed effects in the model, we propose a graphical method, since dividing the coefficients by the standard deviation of a variable as suggested in the literature is unfeasible in a complicated model such as ours. In addition to the speed effects, we find a strong Hispanic effect, some age effects, and a moderate gender effect.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Modeling Repeated Binary Responses and Time-Dependent Missing Covariates with Application to a Tree with a Two-Year Periodicity in Flowering Intensity


Lan Huang*, Ming-Hui Chen*, Paul R. Neal+, and Gregory J. Anderson+

Department of Statistics* and Ecology and Evolutionary Biology+
University of Connecticut

lan@merlot.stat.uconn.edu


In this paper, we develop a novel modeling strategy for analyzing data with repeated binary responses over time as well as with time-dependent missing covariates. We use the generalized linear mixed model (GLMM) for the repeated binary responses. We then propose a joint model for time-dependent missing covariates using information from other sources. The proposed methodology is well motivated by a real application, namely, a study of Tilia americana (American basswood), a tree with a two-year periodicity in flowering intensity. The data consist of an index of flowering intensity collected from 1974 to 2002. The proposed methodology will be used to identify factors such as defoliation by gypsy moths (Lymantria dispar) and weather conditions that may disrupt the cyclical pattern of flowering.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Prediction of Co-Regulated Genes using Motif Discovery and Clustering


Shane Jensen

Harvard University

jensen@stat.harvard.edu


Genes are often regulated in living cells by proteins called transcription factors (TFs) that bind directly to short segments of DNA in close proximity to certain target genes. These short segments have a conserved appearance, which is called a motif. Statistical methods for motif discovery are briefly reviewed. We propose a Bayesian hierarchical clustering model for the common structure between a set of discovered motifs. This clustering model is implemented using a Gibbs sampling strategy and several approaches to analyzing the clustering results are discussed. Techniques for motif discovery and motif clustering are used in combination to predict co-regulated genes in the bacteria Bacillus subtilis. Sequences from several closely related species were used to discover motifs conserved by evolution, and these conserved motifs were then used to cluster genes together into putative co-regulated groups. These predicted clusters are validated and examined in detail using several external measures of cell regulation.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Statistical methods for identifying differential gene-gene interaction patterns


Yinglei Lai, Baolin Wu, Liang Chen and Hongyu Zhao

Center for Statistical Genomics and Proteomics
Department of Epidemiology and Public Health
Yale University School of Medicine

yl335@email.med.yale.edu


To understand cancer mechanisms, it is important to explore molecular changes in cellular processes from normal state to cancerous state. In this study, we address statistical methods for identifying differential gene-gene interaction patterns in different cell states. For efficient pattern recognition, we extend the traditional F-statistic and obtain an Expected Conditional F-statistic, which systematically integrates statistical information about differences of locations and correlations. We also propose a statistical method for data transformation to eliminate outlier problem. Our approach is applied to a microarray gene expression data set for prostate cancer study.

For a gene of interest, our method can select other genes that have differential gene-gene interaction patterns with this gene in different cell states. Among 10 most frequently selected genes, there are genes hepsin, GSTP1 and AMACR. These 3 genes were recently proposed to be associated with prostate cancer. But, it is difficult to identify genes GSTP1 and AMACR by finding differentially expressed genes. Using tumor suppressor genes PTEN, RB1 and TP53, we identify 7 genes that also include hepsin, GSTP1 and AMACR. We show that genes associated with cancer may have differential gene-gene interaction patterns in different cell states. Our statistical approach is capable of discovering such patterns.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































ESTIMATING VARIANCE-COVARIANCE STRUCTURE OF THE NONPARAMETRIC REGRESSION DATA WITH TIME SERIES ERRORS


Chrisitian Dahl and Michael Levine
Purdue University

dahlc@mgmt.purdue.edu, mlevins@stat.purdue.edu


The univariate variance estimation in the context of nonparametric regression is by now a fairly extensively researched topic. Up until now, most of the work had been concerned with general properties of resulting variance estimators, such as asymptotic minimaxity. Potential applications were not extensively considered (as an exception, we want to mention the article "Efficient estimation of conditional variance functions in stochastic regression" by J. Fan and Q. Yao, first published in Biometrika in 1998).

We introduce a model where the data is heteroscedastic with time series based variance-covariance structure

Here ei = F ei-1 + ?i is a stationary AR(1) time series. The variance function f(x) is defined on [0,1] and satisfies very unobtrusive smoothness requirements. This model can be used to describe an exchange rate (with the extraneous covariate being, for example, the current interest rate) and easily generalized to include the trend function g(x). Our main goal is to construct consistent estimators of f(x) and F that are easy to compute and possess good asymptotic properties. We introduce both estimators and then discuss their asymptotic properties and convergence rates.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































A COMPARISON OF GOODNESS OF FIT TESTS FOR THE LOGISTIC GEE MODEL


Lingling Li, B.S.
Center for Biostatistics in AIDS Research
Department of Biostatistics
Harvard School of Public Health
lingling@sdac.harvard.edu

Scott Evans. Ph.D.
Center for Biostatistics in AIDS Research
Department of Biostatistics
Harvard School of Public Health
evans@sdac.harvard.edu


Generalized Estimating Equations have become a popular regression method for analyzing clustered binary data. Statistics to assess the goodness of fit of the fitted models have recently been developed including: statistics based on residuals; a statistic using groups based on ranked estimated probabilities; statistics based on covariate partitioning; and a classification statistic. However, evaluations and comparisons of these methods are limited. We discuss these methods and develop two additional statistics to evaluate goodness of fit. We evaluate the performance of each of the statistics with respect to Type I error rates and power in a simulation study. Guidance is provided regarding appropriate use of the statistics under various scenarios.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































A Statistical Model for Multiple High-throughput Protein-Protein Interaction Assay Assessments


Junfeng Liu, Hongyu Zhao

Yale Center for Statistical Genomics and Proteomics
Division of Biostatistics

Department of Epidemiology and Public Health
Yale School of Medicine

junfeng.liu@yale.edu, hongyu.zhao@yale.edu


For few high-throughput well known yeast 2-hybrid assays conducted in independent labs around the world, this article develops efficient and reliable algorithms to detect the crucial true positive rate, false positive (negative) rate, coverage rate, reliability. Given arbitrary assay designs, EM algorithms are developed to obtain the mode estimation with (without) true positive(negative) Gold Standard Dataset. Possible improved association structure modeling is proposed and tested in mock data sets. Finally we make comparisons with other assessment approaches in current biological literature.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Bayesian inference on stochastic volatility under hidden semi-Markov models


Zhaohui Liu

Department of Statistics
University of Connecticut

Zhaohui.Liu@UConn.edu


In this paper we discuss stochastic volatility models with hidden semi-Markov regime switches. The models analyzed include univariate SV model for financial return series, and bivariate model with both the return and transaction volumes. In addition to the AR structure of the volatility process, it is assumed that the volatility follows a semi-Markov regime switch process. With this new modeling approach, the duration time at each state will take different distributions. Therefore the mean and variance of the volatility at different states could vary, and the duration times the whole volatility process spent on each state will be different. These facts will considerably enhance the modeling capacity and better describe the underlying volatility process that might be influenced by different economic forces.

Statistical inference will be carried out via the MCMC technique. In Bayesian inference framework, prediction can be easily computed. In addition, we will also discuss model selection by marginal likelihood method, pseudo-Bayes factor, prediction based L-measure, and DIC.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































TEST OF INDEPENDNECE BASED ON KENDALL'S PROCESS: TABULATE THE PERCENTILES OF CRAMER-VON MESIS STATISTICS BY THE STURN-LIOUVILLE APPROACH


Kilani Ghoudi
United Arab Emirate University
ghoudi@uaeu.ac.ae

L'moudden Ahmed
Université de Sherbrooke
Québec, Canada
lmoudden@dmi.usherb.ca

Jean Vaillancourt
Universite de Quebec en Outaouais
Canada
vaillancourt@uqo.ca


Let Z1,¼,Zn be n ³ 2 independent copies of a vector Z=(Z(1),Z(2) ) with distribution function H(z). To test if H(z) has independent components, we are going to use the Cramer-von Mesis statistics based on Kendall's process limit, whose covariance function is explicitly known. Genest, Quessy and Rémillard (2002) calculated the percentiles for this statistic by simulation. We propose a very effective numerical approach to calculate this critical value by using the covariance function and the differential equation of Sturn-Liouville.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































GLOBAL HUMAN DEVELOPMENT: EXPLAINING ITS REGIONAL VARIATIONS*


Robert B. Smith
Social Structural Research Inc.
Cambridge, MA

rsmithphd @ aol.com


The United Nations Development Program (UNDP) ranks countries annually on its human development index (HDI), which combines a country's measures of longevity, literacy, and per capita income. This paper applies hierarchical modeling to quantify the factors that predict a country's HDI rank, explain the variability between regions, s2R , and explain the variability between countries within a region, s2c. It assesses the effects of nine civilizations: African, Buddhist, Hindu, Japanese, Latin, Moslem, Orthodox, Sinic, and Western. Civilization strongly predicts a country's rank on the HDI, but it does not provide the strongest causal explanation of the variability in the HDI quantified by s2R and s2c. Among the covariates studied here, present-day slavery (debt bondage, forced labor, chattel slavery, and prostitution) and the lack of political freedom explain much of the variability that is between regions, and corruption explains much of the variability among countries within a region. Additionally, countries with high rates of conflict and social unrest and debt have significantly worse positions on the HDI. Civilizations are best viewed as pointers to underlying social mechanisms like women's education that more directly determine development; its advance may enhance development.



*Author's Note: With contributions by Kevin Bales, who provided several of his measures for analysis, and by Irina Koltoniuc, who helped prepare the analytic database. Helen Fein underscored the importance of slavery and Philip Gibbs of the SAS Institute clarified some of the nuances of PROC MIXED. Andy Baker, Stanley Guterman, and Sreemoti Mukerjee-Roy critiqued earlier drafts. The views expressed here are the author's and do not necessarily reflect the opinions or policies of the United Nations Development Program or any other organization.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Zero-inflated Poisson Regression Models


Lynn Kuo, Henry R. Kranzler, and Chang Hong Song

Department of Statistics
University of Connecticut

changhon@merlot.stat.uconn.edu


This paper applies the zero-inflated Poisson regression model with random effects to evaluating the effectiveness of the drug naltrexone for treatment of problem drinkers. Subjects were randomly assigned to four groups: daily placebo, targeted placebo (i.e., on a reduced schedule), daily naltrexone, and targeted naltrexone. Data were collected on alcohol consumption using structured nightly diaries. The outcome variable, the number of drinks per day, has excess zeros when fitted with a Poisson regression model. Therefore, we developed a longitudinal zero-inflated Poisson regression model to evaluate the treatment effect. The results indicate that the daily naltrexone treatment is the most effective, followed by targeted naltrexone and targeted placebo. The results also indicate that women drink much less than men, and that over time subjects tend to reduce their alcohol consumption. Daily data collection is receiving increased attention in clinical trials; this statistical approach provides a new method for analysis for these data.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Hierarchical models with migration, mutation, and drift: implications for genetic inference


Seongho Song
Department of Statistics,
University of Connecticut
seongho@stat.uconn.edu

Dipak K. Dey
Department of Statistics,
University of Connecticut
dey@stat.uconn.edu

Kent E. Holsinger
Department of Ecology and Evolutionary,
University of Connecticut
kent@darwin.eeb.uconn.edu


Hierarchical structure arises naturally in genetic models. Individuals belonging to the same population are more similar to one another than are those belonging to different populations. Using properties of moment stationarity we develop exact expressions for the mean and covariance of allele frequencies at a single locus for the 2-level hierarchical structure subject to drift, mutation, and migration. For arbitrary mutation and migration matrices, we generalize previous results to multilevel hierarchical model. Consequently, we have closed-form expressions for the mean and covariance of allele frequencies in Wright's finite-island model with constant hierarchical migration and several mutations. It turns out that the correlation among populations and among subpopulations and the correlation between populations and subpopulations vanish for the large size of population and subpopulation. Also we discuss some implications of our results based on Wright's F-statistics as measures of population structure.

Keywords: F-Statistics; Finite-Island Model; Genetic Drift; Hierarchical Population Structure Migration; Mutation


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































ASYMPTOTICALLY EFFICIENT ESTIMATION OF A SURVIVAL FUNCTION IN THE MISSING CENSORING INDICATOR MODEL


Sundar Subramanian
Department of Mathematics and Statistics
University of Maine
Orono

subraman@germain.umemat.maine.edu


We describe a new estimator of a survival function in the random censorship model when the censoring indicator is missing at random for some study subjects. The proposed approach appeals to a known representation for the survival function, expressible as a smooth functional of a certain conditional probability and the cumulative hazard function of the observed minimum. Well-known estimators are substituted into this representation leading to a simple estimator of the survival function. The new estimator, whose asymptotic variance reduces to that of the Kaplan--Meier estimator when all the censoring indicators are observed, is asymptotically efficient.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Statistical Challenges in the Analysis of Mass Extinctions


Steve C. Wang
Department of Mathematics and Statistics
Swarthmore College

scwang@swarthmore.edu


Much of our knowledge of the history of life comes from the fossil record. However, the fossil record is notoriously incomplete; in fact, usually more data are missing than are observed. This incompleteness presents interesting challenges for paleontologists and statisticians. Here we describe approaches for modeling the incompleteness of the fossil record in the context of mass extinctions. These extinctions - such as the end-Cretaceous event in which the dinosaurs perished - have profoundly shaped the course of life on earth. To infer the causes of mass extinctions, it is important to estimate the times of extinction of the species involved. For instance, how can we determine if a set of species went extinct simultaneously or gradually? If they went extinct simultaneously, how can we estimate their common time of extinction? If they went extinct gradually, how long did the extinctions last? We will discuss methods for answering such questions that take into account the incompleteness of the fossil record.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Statistical Challenges in the Analysis of Mass Extinctions


Steve C. Wang
Department of Mathematics and Statistics
Swarthmore College

scwang@swarthmore.edu


Much of our knowledge of the history of life comes from the fossil record. However, the fossil record is notoriously incomplete; in fact, usually more data are missing than are observed. This incompleteness presents interesting challenges for paleontologists and statisticians. Here we describe approaches for modeling the incompleteness of the fossil record in the context of mass extinctions. These extinctions - such as the end-Cretaceous event in which the dinosaurs perished - have profoundly shaped the course of life on earth. To infer the causes of mass extinctions, it is important to estimate the times of extinction of the species involved. For instance, how can we determine if a set of species went extinct simultaneously or gradually? If they went extinct simultaneously, how can we estimate their common time of extinction? If they went extinct gradually, how long did the extinctions last? We will discuss methods for answering such questions that take into account the incompleteness of the fossil record.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































A Two-Stage Nearest-Neighbor Classifier with Application to Microbial Source Tracking


Jayson D. Wilbur

Department of Mathematical Sciences
Worcester Polytechnic Institute

jwilbur@wpi.edu


In general, nearest-neighbor methods classify an object based on the group membership of the training observations within a certain neighborhood of the object in question. These methods share both the advantages and the disadvantages of other methods for distribution-free inference. In this talk a two-stage nearest-neighbor classifier is proposed which attempts to exploit the advantages of the (single-stage) nearest-neighbor classifier while simultaneously reducing the extent to which the classifier is overfit to the training data. This present work is motivated by the problem of microbial source tracking, which attempts to trace the source of bacterial pathogens in water resources using genetic fingerprints. Applications of the proposed methodology to real and simulated data will be presented as time permits.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Imputing Missing Data by Monotone Blocks


Yaming Yu
Harvard University

yu@stat.harvard.edu


The presence of missing data hinders analyses that may be performed by many different users. Multiple imputation (MI) proposed by Rubin 1976 is an effective methodology to handle the problem. The current state-of-the-art procedures for imputing missing data typically fit fully Bayesian models assuming some joint probability distribution for the underlying complete data. Although principled, joint modeling may not capture important relations among the variables. On the other hand, when the missing data pattern is monotone, we may impute missing data variable by variable in a sequential fashion. This method is principled and flexible; however, it only applies to missing data that conform to a monotone pattern.

We propose a new method, {\it imputation by monotone blocks} (IMB), to impute missing data for public-use databases. Here a set of conditional models are specified and the missing data are iteratively imputed and re-imputed based on these conditional models. At each step of the imputation a monotone block of missing data is updated. We investigate the frequency properties of this method (bias, interval length, and coverage probability for complete data statistics, etc.) by simulation and derive guidelines on good update strategies for use in practice.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments


Lynn Kuo, Fang Yu, and Yifang Zhao
University of Connecticut

lynn@stat.uconn.edu
fangyu@stat.uconn.edu
yifang@stat.uconn.edu


Several statistical methods are available for selecting differentially expressed genes from the microarray data with replication. These methods include the Benjamini and Hochberg method in multiple comparison, significant analysis of microarray (SAM) by Tusher, Tibshiranni, and Chu, the Bayesian t method by Baldi and Long, and the empirical Bayes semiparametric method by Newton. This paper reviews and compares these four methods. ROC curves are constructed to compare them based on three simulated data sets. We will discuss the results. The advantages of using the empirical Bayes semiparametric method are further illustrated by a real data set using Affymetric microarrays.


Return to the Top of This Page

Go to the Next Abstract

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page








































Equi-Energy Sampler With Applications to Mixture Model Simulation and Density of States Calculation


Qing Zhou
(Joint work with Samuel Kou and Wing H. Wong)

Harvard University

zhou@stat.harvard.edu


We introduce the Equi-Energy sampler (EE sampler), which is constructed based on a temperature ladder and a series of truncated energy functions. The sample space of the target distribution is partitioned into N ranges according to their energy levels, and a population of N distributions associated with each temperature and truncated energy function is simultaneously sampled. The samples generated from each distribution are stored separately in their corresponding energy ranges. One crucial step in the EE sampler is the step of Equi-Energy jump, which allows the sample in a given chain to jump to a configuration in the neighboring chains with similar energy level. The EE jump greatly improves the mixing of the system and decreases the sample autocorrelations. We will illustrate the power of the EE sampler through the simulation of multimodal distributions and show by examples that our method can be efficiently utilized to calculate the density of states and construct estimates for a range of problems.


Return to the Top of This Page

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page