2004 NESS PROGRAM

9:00 - 9:30 a.m.REGISTRATION - Coffee and RefreshmentsArcade - 1st Floor Science Center Lobby

9:30 - 10:00 a.m.WELCOMEHall E - Science Center Basement Level
Donald B. Rubin, Chairman
John L. Loeb Professor of Statistics

MORNING INVITED SESSION
10:00 - 11:30 a.m.STATISTICS AT GENZYMEHall E - Science Center Basement Level
Utilizing historical patients in a clinical trial that became open-label
Samantha Cook
James MacDougall
Elizabeth Stuart

MORNING CONTRIBUTED SESSIONS
STATISTICAL APPLICATIONSRoom B09 - Science Center Basement Level
10:00 - 10:20 a.m.How to measure age, race and gender effects in ticketing for speeding in Massachusetts
Dominique Haughton
10:20 - 10:40 a.m.GLOBAL HUMAN DEVELOPMENT: EXPLAINING ITS REGIONAL VARIATIONS*
Robert B. Smith
10:40 - 11:00 a.m.Statistical Challenges in the Analysis of Mass Extinctions
Steve C. Wang
11:00 - 11:20 a.m.Zero-inflated Poisson Regression Models
Chang Hong Song
11:20 - 11:30 a.m.Open Floor Discussion
STATISTICAL MODELINGRoom 110 - 1st Floor Science Center
10:00 - 10:20 a.m.Investigating the Effects of Working Conditions on Health Care Quality via Structural Equation Models
Sonali Das
10:20 - 10:40 a.m.Estimation in stationary Markov renewal processes, with application to earthquake forecasting in Turkey
Enrique E. Alvarez
10:40 - 11:00 a.m.Modeling Repeated Binary Responses and Time-Dependent Missing Covariates with Application to a Tree with a Two-Year Periodicity in Flowering Intensity
Lan Huang
11:00 - 11:20 a.m.Hierarchical models with migration, mutation, and drift: implications for genetic inference
Seongho Song
11:20 - 11:30 a.m.Open Floor Discussion
REGRESSION, TESTS & CLASSIFICATIONRoom 112 - 1st Floor Science Center
10:00 - 10:20 a.m.ESTIMATING VARIANCE-COVARIANCE STRUCTURE OF THE NONPARAMETRIC REGRESSION DATA WITH TIME SERIES ERRORS
Michael Levine
10:20 - 10:40 a.m.A COMPARISON OF GOODNESS OF FIT TESTS FOR THE LOGISTIC GEE MODEL
Lingling Li
10:40 - 11:00 a.m.Clustered Poisson regression or why I don't like GEE
Eugene Demidenko
11:00 - 11:20 a.m.A Two-Stage Nearest-Neighbor Classifier with Application to Microbial Source Tracking
Jayson D. Wilbur
11:20 - 11:30 a.m.Open Floor Discussion

11:30 - 11:45 a.m.BREAK - Coffee and RefreshmentsArcade - 1st Floor Science Center Lobby

MORNING KEYNOTE PRESENTATION
11:45 - 12:45 a.m.GEORGE W. COBBHall E - Science Center Basement Level
Is the Mathematical Statistics Course in a Vegetative State?

12:45 - 2:00 p.m.LUNCHGreenhouse Cafe - 1st Floor Science Center

AFTERNOON KEYNOTE PRESENTATION
2:00 - 3:00 p.m.ANDREW W. LOHall E - Science Center Basement Level
Temporal Averaging and Nonstationarities in Financial Markets

3:00 - 3:15 p.m.BREAK - Coffee and RefreshmentsArcade - 1st Floor Science Center Lobby

AFTERNOON INVITED SESSION
3:15 - 4:45 a.m.FINANCE AND STATISTICSHall E - Science Center Basement Level
Some Remarks about the Methodology and the Mythology of Financial Markets
Andrew Layasoff
Market Efficiency, Bayes' Rule and the Recency Bias
Steve Jordan
Shifting paradigms: on the robustness of economic models to heavy-tailedness assumptions
Rustam Ibragimov

AFTERNOON CONTRIBUTED SESSIONS
STATISTICS IN BIOLOGYRoom B09 - Science Center Basement Level
3:15 - 3:35 p.m.A Statistical Model for Multiple High-throughput Protein-Protein Interaction Assay Assessments
Junfeng Liu
3:35 - 3:55 p.m.Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments
Yifang Zhao
3:55 - 4:15 p.m.Prediction of Co-Regulated Genes using Motif Discovery and Clustering
Shane Jensen
4:15 - 4:35 p.m.Statistical methods for identifying differential gene-gene interaction patterns
Linglei Lai
4:35 - 4:45 p.m.Open Floor Discussion
BAYESIAN STATISTICS & STATISTICAL COMPUTINGRoom 110 - 1st Floor Science Center
3:15 - 3:35 p.m.Equi-Energy Sampler With Applications to Mixture Model Simulation and Density of States Calculation
Qing Zhou
3:35 - 3:55 p.m.Bayesian inference on stochastic volatility under hidden semi-Markov models
Zhaohui Liu
3:55 - 4:15 p.m.Imputing Missing Data by Monotone Blocks
Yaming Yu
4:15 - 4:35 p.m.A Bayesian Alternative to the Chi-Squared Test of Association in a Two-Way Categorical Table with Intra-Class Correlation
Jai Won Choi
4:35 - 4:45 p.m.Open Floor Discussion
MATHEMATICAL STATISTICSRoom 112 - 1st Floor Science Center
3:15 - 3:35 p.m.ASYMPTOTICALLY EFFICIENT ESTIMATION OF A SURVIVAL FUNCTION IN THE MISSING CENSORING INDICATOR MODEL
Sundar Subramanian
3:35 - 3:55 p.m.Bayesian Simultaneous Intervals for Small Area Estimation: An Application to Mapping Mortality Rates in U.S. Health Service Areas
Erik Barry Erhardt
3:55 - 4:15 p.m.Some Information Bounds and Asymptotic Variances
T.M. Durairajan
4:15 - 4:35 p.m.TEST OF INDEPENDNECE BASED ON KENDALL'S PROCESS: TABULATE THE PERCENTILES OF CRAMER-VON MESIS STATISTICS BY THE STURN-LIOUVILLE APPROACH
Ahmed L'moudden
4:35 - 4:45 p.m.Open Floor Discussion

4:45 - 6:30 p.m.RECEPTION AND INFORMAL PARTYHarvard University Statistics Department, 7th Floor

7:00 p.m.Dinner at Yenching Restaurant1326 Mass. Ave., Cambridge

Return to the Top of This Page

Return to the 2004 New England Statistics Symposium Home Page

Return to the Harvard University Statistics Department Home Page







































Estimation in stationary Markov renewal processes, with application to earthquake forecasting in Turkey


Enrique E. Alvarez

Department of Statistics
University of Connecticut

ealvarez@merlot.stat.uconn.edu


Consider a process in which different events occur, with random inter-occurrence times. In Markov renewal processes, the sequence of events is a Markov chain and the waiting distributions depend only on the types of the last and the next event. Suppose that the state-space is finite and that the process started far in the past, achieving stationary. Weibull distributions are proposed for the waiting times and their parameters are estimated jointly with the transition probabilities through maximum likelihood, when one or several realizations of the process are observed over finite windows. The model is illustrated with data of earthquakes of three types of severity that occurred in Turkey during the 20th century.

Return to Where You Were on the Program Page








































A Bayesian Alternative to the Chi-Squared Test of Association in a Two-Way Categorical Table with Intra-Class Correlation


Balgobin Nandram* and Jai Won Choi**

Worcester Polytechnic Institute* and National Center for Health Statistics**

balnan@wpi.edu* and jwc7@cdc.gov**


It is straight forward to analyze data from a single multinomial table. Specifically, for the analysis of a two-way categorical table, the common chi-squared test of independence between the two variables and maximum likelihood estimators are readily available. When the counts in the two-way categorical table are formed from familial data (clusters of correlated data), the common chi-squared test no longer applies. We note that there are several approximate adjustments to the common chi-squared test. However, our main contribution is the construction and analysis of a Bayesian model which removes all analytical approximations. This is an extension of a standard multinomial-Dirichlet model to include the intra-class correlation associated with the individuals within a cluster. This intra-class correlation varies with the size of the cluster, but we assume that it is the same for all clusters of the same size for the same variable. We use Markov chain Monte Carlo methods to fit our model, and to make

Return to Where You Were on the Program Page








































Investigating the Effects of Working Conditions on Health Care Quality via Structural Equation Models


Sonali Das*, Ming-Hui Chen*, Nicholas Warren+, and Dipak Dey*

*Department of Statistics
University of Connecticut

+Assistant Professor of Medicine/Ergonomics Coordinator,
University of Connecticut Health Center

*sonali@stat.uconn.edu
*mhchen@stat.uconn.edu
+warren@nso.uchc.edu
*dey@stat.uconn.edu


The issue of health care quality has recently received significant attention both within and outside the medical community. Working conditions in health care settings, including job health and safety, have direct and indirect impacts on health care quality. Much previous research has tended to focus on a single level or group factors that affect health care quality. In this work based on a large national survey, we identify a web of factors influencing employee's perception of the organization, and outcomes of employee behavior via variables such as satisfaction, stress, turnover intention and perception of quality. In this study, we discuss different structural equation models (SEMs) based on path analysis consisting of latent and indicator (manifest) variables. The aim here is to model the relationships to get the "best" predicted covariance structure. We discuss some results, their implications and modifications to the models.

Return to Where You Were on the Program Page








































Clustered Poisson regression or why I don't like GEE


Eugene Demidenko

Dartmouth College, NH

eugened@dartmouth.edu


Statistical properties of a Poisson regression with random intercepts are studied. This is a perfect model to compare quasi-likelihood estimation, such as GEE, with maximum likelihood because both use the same model up to a factor. In other words, the marginal model (GEE) and conditional model (random effect) are the same for Poisson regression with random intercepts. Five methods of estimation for clustered Poisson regression are considered: standard Poisson regression (naïve), fixed-intercept Poisson model, GEE, Exact GEE (EGEE), and MLE. The beauty of the Poisson model is that the exact covariance matrix can be computed in closed from that is called EGEE. All five methods produce consistent estimates for regression coefficients but have different efficiency in large samples. We derive the asymptotic covariance matrix for each method for any distribution of the random intercept. We analytically compare and test the methods via simulations. The five split into two groups: naïve Poisson & GEE and the rest. Although the compound symmetry structure seems natural for the random-intercept model this working correlation structure never coincides with the true one. Consequently, GEE loses much efficiency and becomes not more efficient than naïve Poisson regression, which ignores cluster correlation. On the other hand, fixed-intercept approach, EGEE, and MLE are very close and are the same if the data are balanced.

Return to Where You Were on the Program Page








































Some Information Bounds and Asymptotic Variances


T.M. Durairajan

Department of Statistics
Loyola College
Chennai, India

tmdurairajan2001@yahoo.co.uk


In large sample methods for inference, there are several examples of asymptotic distributions of statistics whose asymptotic variances are not related to Fisher Information. In this paper, we define various information measures and obtain different information bounds which are the inverse of asymtotic variances in different situations. We also relate these information bounds to the estimation of parameters of interest in the presence of the nuisance parameters in finite sample.

Return to Where You Were on the Program Page








































Bayesian Simultaneous Intervals for Small Area Estimation: An Application to Mapping Mortality Rates in U.S. Health Service Areas


Erik Barry Erhardt

Graduate Student, WPI

erike@WPI.EDU


It is customary when presenting a choropleth map of rates or counts to present only the estimates (mean or mode) of the parameters of interest. While this technique illustrates spatial variation, it ignores the variation inherent in the estimates. We describe an approach to present variability in choropleth maps by constructing 100(1-alpha)% simultaneous intervals. The result provides three maps (estimate with two bands).

We propose two methods to construct simultaneous intervals from the optimal individual highest posterior density (HPD) intervals to ensure joint simultaneous coverage of 100(1-alpha)%.

Both methods exhibit the main feature of multiplying the lower bound and dividing the upper bound of the individual HPD intervals by parameters 0
For illustrative purposes we apply our methods to chronic obstructive pulmonary disease (COPD) mortality rates from 1988--92, subset White Males age group 65 and older, for the continental United States consisting of 798 Health Service Areas (HSA).

Return to Where You Were on the Program Page








































How to measure age, race and gender effects in ticketing for speeding in Massachusetts


Dominique Haughton
Bentley College

Phong Nguyen
Bentley College and General Statistics Office, Hanoi

dhaughton@bentley.edu


The claim is sometimes made, as most recently in the Boston Globe (7/20/2003) that "race, sex and age drive ticketing" by police on Massachusetts roads. We use a database of speeding tickets and warnings obtained by the Globe to build a logistic model of who gets ticketed and who only gets warned. The model involves a rather complicated non linear function of speed and speed over the speed limit as well as some interactions, identified with the help of MARS (Multiple Adaptive Regression Splines). In order to discuss the importance of the race, gender and age effects relative to the speed effects in the model, we propose a graphical method, since dividing the coefficients by the standard deviation of a variable as suggested in the literature is unfeasible in a complicated model such as ours. In addition to the speed effects, we find a strong Hispanic effect, some age effects, and a moderate gender effect.

Return to Where You Were on the Program Page








































Modeling Repeated Binary Responses and Time-Dependent Missing Covariates with Application to a Tree with a Two-Year Periodicity in Flowering Intensity


Lan Huang*, Ming-Hui Chen*, Paul R. Neal+, and Gregory J. Anderson+

Department of Statistics* and Ecology and Evolutionary Biology+
University of Connecticut

lan@merlot.stat.uconn.edu


In this paper, we develop a novel modeling strategy for analyzing data with repeated binary responses over time as well as with time-dependent missing covariates. We use the generalized linear mixed model (GLMM) for the repeated binary responses. We then propose a joint model for time-dependent missing covariates using information from other sources. The proposed methodology is well motivated by a real application, namely, a study of Tilia americana (American basswood), a tree with a two-year periodicity in flowering intensity. The data consist of an index of flowering intensity collected from 1974 to 2002. The proposed methodology will be used to identify factors such as defoliation by gypsy moths (Lymantria dispar) and weather conditions that may disrupt the cyclical pattern of flowering.

Return to Where You Were on the Program Page








































Prediction of Co-Regulated Genes using Motif Discovery and Clustering


Shane Jensen

Harvard University

jensen@stat.harvard.edu


Genes are often regulated in living cells by proteins called transcription factors (TFs) that bind directly to short segments of DNA in close proximity to certain target genes. These short segments have a conserved appearance, which is called a motif. Statistical methods for motif discovery are briefly reviewed. We propose a Bayesian hierarchical clustering model for the common structure between a set of discovered motifs. This clustering model is implemented using a Gibbs sampling strategy and several approaches to analyzing the clustering results are discussed. Techniques for motif discovery and motif clustering are used in combination to predict co-regulated genes in the bacteria Bacillus subtilis. Sequences from several closely related species were used to discover motifs conserved by evolution, and these conserved motifs were then used to cluster genes together into putative co-regulated groups. These predicted clusters are validated and examined in detail using several external measures of cell regulation.

Return to Where You Were on the Program Page








































Statistical methods for identifying differential gene-gene interaction patterns


Yinglei Lai, Baolin Wu, Liang Chen and Hongyu Zhao

Center for Statistical Genomics and Proteomics
Department of Epidemiology and Public Health
Yale University School of Medicine

yl335@email.med.yale.edu


To understand cancer mechanisms, it is important to explore molecular changes in cellular processes from normal state to cancerous state. In this study, we address statistical methods for identifying differential gene-gene interaction patterns in different cell states. For efficient pattern recognition, we extend the traditional F-statistic and obtain an Expected Conditional F-statistic, which systematically integrates statistical information about differences of locations and correlations. We also propose a statistical method for data transformation to eliminate outlier problem. Our approach is applied to a microarray gene expression data set for prostate cancer study.

For a gene of interest, our method can select other genes that have differential gene-gene interaction patterns with this gene in different cell states. Among 10 most frequently selected genes, there are genes hepsin, GSTP1 and AMACR. These 3 genes were recently proposed to be associated with prostate cancer. But, it is difficult to identify genes GSTP1 and AMACR by finding differentially expressed genes. Using tumor suppressor genes PTEN, RB1 and TP53, we identify 7 genes that also include hepsin, GSTP1 and AMACR. We show that genes associated with cancer may have differential gene-gene interaction patterns in different cell states. Our statistical approach is capable of discovering such patterns.

Return to Where You Were on the Program Page








































ESTIMATING VARIANCE-COVARIANCE STRUCTURE OF THE NONPARAMETRIC REGRESSION DATA WITH TIME SERIES ERRORS


Chrisitian Dahl and Michael Levine
Purdue University

dahlc@mgmt.purdue.edu, mlevins@stat.purdue.edu


The univariate variance estimation in the context of nonparametric regression is by now a fairly extensively researched topic. Up until now, most of the work had been concerned with general properties of resulting variance estimators, such as asymptotic minimaxity. Potential applications were not extensively considered (as an exception, we want to mention the article "Efficient estimation of conditional variance functions in stochastic regression" by J. Fan and Q. Yao, first published in Biometrika in 1998).

We introduce a model where the data is heteroscedastic with time series based variance-covariance structure

Here ei = F ei-1 + ?i is a stationary AR(1) time series. The variance function f(x) is defined on [0,1] and satisfies very unobtrusive smoothness requirements. This model can be used to describe an exchange rate (with the extraneous covariate being, for example, the current interest rate) and easily generalized to include the trend function g(x). Our main goal is to construct consistent estimators of f(x) and F that are easy to compute and possess good asymptotic properties. We introduce both estimators and then discuss their asymptotic properties and convergence rates.

Return to Where You Were on the Program Page








































A COMPARISON OF GOODNESS OF FIT TESTS FOR THE LOGISTIC GEE MODEL


Lingling Li, B.S.
Center for Biostatistics in AIDS Research
Department of Biostatistics
Harvard School of Public Health
lingling@sdac.harvard.edu

Scott Evans. Ph.D.
Center for Biostatistics in AIDS Research
Department of Biostatistics
Harvard School of Public Health
evans@sdac.harvard.edu


Generalized Estimating Equations have become a popular regression method for analyzing clustered binary data. Statistics to assess the goodness of fit of the fitted models have recently been developed including: statistics based on residuals; a statistic using groups based on ranked estimated probabilities; statistics based on covariate partitioning; and a classification statistic. However, evaluations and comparisons of these methods are limited. We discuss these methods and develop two additional statistics to evaluate goodness of fit. We evaluate the performance of each of the statistics with respect to Type I error rates and power in a simulation study. Guidance is provided regarding appropriate use of the statistics under various scenarios.

Return to Where You Were on the Program Page








































A Statistical Model for Multiple High-throughput Protein-Protein Interaction Assay Assessments


Junfeng Liu, Hongyu Zhao

Yale Center for Statistical Genomics and Proteomics
Division of Biostatistics

Department of Epidemiology and Public Health
Yale School of Medicine

junfeng.liu@yale.edu, hongyu.zhao@yale.edu


For few high-throughput well known yeast 2-hybrid assays conducted in independent labs around the world, this article develops efficient and reliable algorithms to detect the crucial true positive rate, false positive (negative) rate, coverage rate, reliability. Given arbitrary assay designs, EM algorithms are developed to obtain the mode estimation with (without) true positive(negative) Gold Standard Dataset. Possible improved association structure modeling is proposed and tested in mock data sets. Finally we make comparisons with other assessment approaches in current biological literature.

Return to Where You Were on the Program Page








































Bayesian inference on stochastic volatility under hidden semi-Markov models


Zhaohui Liu

Department of Statistics
University of Connecticut

Zhaohui.Liu@UConn.edu


In this paper we discuss stochastic volatility models with hidden semi-Markov regime switches. The models analyzed include univariate SV model for financial return series, and bivariate model with both the return and transaction volumes. In addition to the AR structure of the volatility process, it is assumed that the volatility follows a semi-Markov regime switch process. With this new modeling approach, the duration time at each state will take different distributions. Therefore the mean and variance of the volatility at different states could vary, and the duration times the whole volatility process spent on each state will be different. These facts will considerably enhance the modeling capacity and better describe the underlying volatility process that might be influenced by different economic forces.

Statistical inference will be carried out via the MCMC technique. In Bayesian inference framework, prediction can be easily computed. In addition, we will also discuss model selection by marginal likelihood method, pseudo-Bayes factor, prediction based L-measure, and DIC.

Return to Where You Were on the Program Page








































TEST OF INDEPENDNECE BASED ON KENDALL'S PROCESS: TABULATE THE PERCENTILES OF CRAMER-VON MESIS STATISTICS BY THE STURN-LIOUVILLE APPROACH


Kilani Ghoudi
United Arab Emirate University
ghoudi@uaeu.ac.ae

L'moudden Ahmed
Université de Sherbrooke
Québec, Canada
lmoudden@dmi.usherb.ca

Jean Vaillancourt
Universite de Quebec en Outaouais
Canada
vaillancourt@uqo.ca


Let Z1,¼,Zn be n ³ 2 independent copies of a vector Z=(Z(1),Z(2) ) with distribution function H(z). To test if H(z) has independent components, we are going to use the Cramer-von Mesis statistics based on Kendall's process limit, whose covariance function is explicitly known. Genest, Quessy and Rémillard (2002) calculated the percentiles for this statistic by simulation. We propose a very effective numerical approach to calculate this critical value by using the covariance function and the differential equation of Sturn-Liouville.

Return to Where You Were on the Program Page








































GLOBAL HUMAN DEVELOPMENT: EXPLAINING ITS REGIONAL VARIATIONS*


Robert B. Smith
Social Structural Research Inc.
Cambridge, MA

rsmithphd @ aol.com


The United Nations Development Program (UNDP) ranks countries annually on its human development index (HDI), which combines a country's measures of longevity, literacy, and per capita income. This paper applies hierarchical modeling to quantify the factors that predict a country's HDI rank, explain the variability between regions, s2R , and explain the variability between countries within a region, s2c. It assesses the effects of nine civilizations: African, Buddhist, Hindu, Japanese, Latin, Moslem, Orthodox, Sinic, and Western. Civilization strongly predicts a country's rank on the HDI, but it does not provide the strongest causal explanation of the variability in the HDI quantified by s2R and s2c. Among the covariates studied here, present-day slavery (debt bondage, forced labor, chattel slavery, and prostitution) and the lack of political freedom explain much of the variability that is between regions, and corruption explains much of the variability among countries within a region. Additionally, countries with high rates of conflict and social unrest and debt have significantly worse positions on the HDI. Civilizations are best viewed as pointers to underlying social mechanisms like women's education that more directly determine development; its advance may enhance development.



*Author's Note: With contributions by Kevin Bales, who provided several of his measures for analysis, and by Irina Koltoniuc, who helped prepare the analytic database. Helen Fein underscored the importance of slavery and Philip Gibbs of the SAS Institute clarified some of the nuances of PROC MIXED. Andy Baker, Stanley Guterman, and Sreemoti Mukerjee-Roy critiqued earlier drafts. The views expressed here are the author's and do not necessarily reflect the opinions or policies of the United Nations Development Program or any other organization.

Return to Where You Were on the Program Page








































Zero-inflated Poisson Regression Models


Lynn Kuo, Henry R. Kranzler, and Chang Hong Song

Department of Statistics
University of Connecticut

changhon@merlot.stat.uconn.edu


This paper applies the zero-inflated Poisson regression model with random effects to evaluating the effectiveness of the drug naltrexone for treatment of problem drinkers. Subjects were randomly assigned to four groups: daily placebo, targeted placebo (i.e., on a reduced schedule), daily naltrexone, and targeted naltrexone. Data were collected on alcohol consumption using structured nightly diaries. The outcome variable, the number of drinks per day, has excess zeros when fitted with a Poisson regression model. Therefore, we developed a longitudinal zero-inflated Poisson regression model to evaluate the treatment effect. The results indicate that the daily naltrexone treatment is the most effective, followed by targeted naltrexone and targeted placebo. The results also indicate that women drink much less than men, and that over time subjects tend to reduce their alcohol consumption. Daily data collection is receiving increased attention in clinical trials; this statistical approach provides a new method for analysis for these data.

Return to Where You Were on the Program Page








































Hierarchical models with migration, mutation, and drift: implications for genetic inference


Seongho Song
Department of Statistics,
University of Connecticut
seongho@stat.uconn.edu

Dipak K. Dey
Department of Statistics,
University of Connecticut
dey@stat.uconn.edu

Kent E. Holsinger
Department of Ecology and Evolutionary,
University of Connecticut
kent@darwin.eeb.uconn.edu


Hierarchical structure arises naturally in genetic models. Individuals belonging to the same population are more similar to one another than are those belonging to different populations. Using properties of moment stationarity we develop exact expressions for the mean and covariance of allele frequencies at a single locus for the 2-level hierarchical structure subject to drift, mutation, and migration. For arbitrary mutation and migration matrices, we generalize previous results to multilevel hierarchical model. Consequently, we have closed-form expressions for the mean and covariance of allele frequencies in Wright's finite-island model with constant hierarchical migration and several mutations. It turns out that the correlation among populations and among subpopulations and the correlation between populations and subpopulations vanish for the large size of population and subpopulation. Also we discuss some implications of our results based on Wright's F-statistics as measures of population structure.

Keywords: F-Statistics; Finite-Island Model; Genetic Drift; Hierarchical Population Structure Migration; Mutation

Return to Where You Were on the Program Page








































ASYMPTOTICALLY EFFICIENT ESTIMATION OF A SURVIVAL FUNCTION IN THE MISSING CENSORING INDICATOR MODEL


Sundar Subramanian
Department of Mathematics and Statistics
University of Maine
Orono

subraman@germain.umemat.maine.edu


We describe a new estimator of a survival function in the random censorship model when the censoring indicator is missing at random for some study subjects. The proposed approach appeals to a known representation for the survival function, expressible as a smooth functional of a certain conditional probability and the cumulative hazard function of the observed minimum. Well-known estimators are substituted into this representation leading to a simple estimator of the survival function. The new estimator, whose asymptotic variance reduces to that of the Kaplan--Meier estimator when all the censoring indicators are observed, is asymptotically efficient.

Return to Where You Were on the Program Page








































Statistical Challenges in the Analysis of Mass Extinctions


Steve C. Wang
Department of Mathematics and Statistics
Swarthmore College

scwang@swarthmore.edu


Much of our knowledge of the history of life comes from the fossil record. However, the fossil record is notoriously incomplete; in fact, usually more data are missing than are observed. This incompleteness presents interesting challenges for paleontologists and statisticians. Here we describe approaches for modeling the incompleteness of the fossil record in the context of mass extinctions. These extinctions - such as the end-Cretaceous event in which the dinosaurs perished - have profoundly shaped the course of life on earth. To infer the causes of mass extinctions, it is important to estimate the times of extinction of the species involved. For instance, how can we determine if a set of species went extinct simultaneously or gradually? If they went extinct simultaneously, how can we estimate their common time of extinction? If they went extinct gradually, how long did the extinctions last? We will discuss methods for answering such questions that take into account the incompleteness of the fossil record.

Return to Where You Were on the Program Page








































A Two-Stage Nearest-Neighbor Classifier with Application to Microbial Source Tracking


Jayson D. Wilbur

Department of Mathematical Sciences
Worcester Polytechnic Institute

jwilbur@wpi.edu


In general, nearest-neighbor methods classify an object based on the group membership of the training observations within a certain neighborhood of the object in question. These methods share both the advantages and the disadvantages of other methods for distribution-free inference. In this talk a two-stage nearest-neighbor classifier is proposed which attempts to exploit the advantages of the (single-stage) nearest-neighbor classifier while simultaneously reducing the extent to which the classifier is overfit to the training data. This present work is motivated by the problem of microbial source tracking, which attempts to trace the source of bacterial pathogens in water resources using genetic fingerprints. Applications of the proposed methodology to real and simulated data will be presented as time permits.

Return to Where You Were on the Program Page








































Imputing Missing Data by Monotone Blocks


Yaming Yu
Harvard University

yu@stat.harvard.edu


The presence of missing data hinders analyses that may be performed by many different users. Multiple imputation (MI) proposed by Rubin 1976 is an effective methodology to handle the problem. The current state-of-the-art procedures for imputing missing data typically fit fully Bayesian models assuming some joint probability distribution for the underlying complete data. Although principled, joint modeling may not capture important relations among the variables. On the other hand, when the missing data pattern is monotone, we may impute missing data variable by variable in a sequential fashion. This method is principled and flexible; however, it only applies to missing data that conform to a monotone pattern.

We propose a new method, {\it imputation by monotone blocks} (IMB), to impute missing data for public-use databases. Here a set of conditional models are specified and the missing data are iteratively imputed and re-imputed based on these conditional models. At each step of the imputation a monotone block of missing data is updated. We investigate the frequency properties of this method (bias, interval length, and coverage probability for complete data statistics, etc.) by simulation and derive guidelines on good update strategies for use in practice.

Return to Where You Were on the Program Page








































Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments


Lynn Kuo, Fang Yu, and Yifang Zhao
University of Connecticut

lynn@stat.uconn.edu
fangyu@stat.uconn.edu
yifang@stat.uconn.edu


Several statistical methods are available for selecting differentially expressed genes from the microarray data with replication. These methods include the Benjamini and Hochberg method in multiple comparison, significant analysis of microarray (SAM) by Tusher, Tibshiranni, and Chu, the Bayesian t method by Baldi and Long, and the empirical Bayes semiparametric method by Newton. This paper reviews and compares these four methods. ROC curves are constructed to compare them based on three simulated data sets. We will discuss the results. The advantages of using the empirical Bayes semiparametric method are further illustrated by a real data set using Affymetric microarrays.

Return to Where You Were on the Program Page








































Equi-Energy Sampler With Applications to Mixture Model Simulation and Density of States Calculation


Qing Zhou
(Joint work with Samuel Kou and Wing H. Wong)

Harvard University

zhou@stat.harvard.edu


We introduce the Equi-Energy sampler (EE sampler), which is constructed based on a temperature ladder and a series of truncated energy functions. The sample space of the target distribution is partitioned into N ranges according to their energy levels, and a population of N distributions associated with each temperature and truncated energy function is simultaneously sampled. The samples generated from each distribution are stored separately in their corresponding energy ranges. One crucial step in the EE sampler is the step of Equi-Energy jump, which allows the sample in a given chain to jump to a configuration in the neighboring chains with similar energy level. The EE jump greatly improves the mixing of the system and decreases the sample autocorrelations. We will illustrate the power of the EE sampler through the simulation of multimodal distributions and show by examples that our method can be efficiently utilized to calculate the density of states and construct estimates for a range of problems.

Return to Where You Were on the Program Page








































Shifting paradigms: on the robustness of economic models to heavy-tailedness assumptions


Rustam Ibragimov
Department of Economics
Yale University
rustam.ibragimov@yale.edu


The structure of many models in economics and finance depends on majorization properties of convolutions of distributions. In this paper, we analyze robustness of these properties and the models based on them to heavy-tailedness assumptions. We show, in particular, that majorization properties of linear combinations of log-concavely distributed signals are reversed for very long-tailed distributions. As applications of the results, we study robustness of monotone consistency of the sample mean, value at risk analysis and the model of demand-driven innovation and spatial competition as well as that of optimal bundling strategies for a multiproduct monopolist in the case of an arbitrary degree of complementarity or substitutability among the goods. The implications of the models remain valid for not too heavy-tailed distributions. However, their main properties are reversed in the very thick-tailed setting.

Return to Where You Were on the Program Page








































Market Efficiency, Bayes' Rule and the Recency Bias


Steve Jordan
Ph.D. Candidate in Financial Economics
Yale School of Management
steven.jordan@yale.edu


There is a large psychological and behavioral literature that claims that Bayes' rule is not a valid model of the human decision process. The strongest evidence proposed in financial markets against Bayes' rule is long-term overreaction (De Bondt and Thaler (1985)) and excess volatility (Shiller (1981)). Under a random walk model of market efficiency simulations support these claims. I propose a random walk with noise model as an intuitively appealing alternative for market efficiency. I demonstrate that within a Bayesian framework that both these market anomalies are consistent and predicted under a random walk with noise, resolving some long-standing debates. The intuition is simple. Under a random walk prices must adjust fully and immediately to new information. This constraint binds price behavior and causes violations of many observed market phenomenon. However, under a random walk with noise, there is an extra degree of freedom, the signal to noise ratio, which breaks this overly restrictive constraint of immediate and full reaction. My model allows me to be the first to empirically estimate the signal to noise ratio in the market. I find a significant level of noise with the variance of noise about 5 times that of information.

Return to Where You Were on the Program Page








































Some Remarks about the Methodology and the Mythology of Financial Markets


Andrew Layasoff
Department of Mathematics and Statistics
Boston University
alyasoff@math.bu.edu


Many aspects of modern finance stem from the belief that, generally, one can think of daily rates of return from stocks as independent samples from one and the same (stock specific) probability law. I will briefly review some of the key statistical tests that have been used in the past to justify this claim and will show with a concrete example that the results from these tests are just as compatible with time series that exhibit self-regulatory behavior and are therefore very different from random walks. Another common belief among some practitioners is that, unless the returns behave as random walks, there will be fortunes to be made on the stock market out of nothing. This is quite surprising, because the conditions under which markets allow or do not allow arbitrage have been studied extensively in the last 20 years and it has long been established that very general models for stock prices are arbitrage free. I will briefly review the principles from which such conditions can be derived and will propose an alternative to the random walk model for stock returns.

Return to Where You Were on the Program Page








































Is the Mathematical Statistics Course in a Vegetative State?

George W. Cobb
Department of Mathematics and Statistics
Mount Holyoke College

GCobb@MtHolyoke.edu



The future of the statistics profession depends on success in recruiting students to our subject. There will always be a need for statistics, of course, but there is a growing risk that the work of statistics will be taken over by people who think of themselves not as statisticians but as molecular biologists, market researchers, computer scientists, and others who specialize in any of the applied fields that use statistics.

What all these applications of statistics have in common, the glue that holds our universe together, is a reliance on mathematical models of uncertainty. Even as statistics has expanded the breadth and value of its achievements in applied areas, the core of our subject remains mathematical, and our future depends on attracting mathematically talented students. At most undergraduate colleges, the burden of doing that attracting falls to the mathematical statistics course, a course that has changed very little in half a century, despite extraordinary changes in the practice of statistics.

In my talk I will offer some criticisms of the traditional math stat course, and suggest some remedies and alternatives.

Return to Where You Were on the Program Page








































Temporal Averaging and Nonstationarities in Financial Markets

Andrew W. Lo
MIT

alo@mit.edu


Whitney K. Newey
MIT

wnewey@mit.edu



A common practice among quantitative financial analysts for dealing with non-stationary time series is to apply an exponentially declining weighting scheme to the data so that more distant observations are given less weight than more recent observations. We show that this practice of temporal averaging is incorrect for all but linear estimators, implying that variances, covariances, betas, Sharpe ratios, Value-at-Risk, and many other common financial statistics are incorrectly estimated with exponentially weighted time series. We propose an alternative approach to temporal averaging that yields unbiased estimators under the null hypothesis of independently and identically distributed observations, and which has attractive asymptotic properties under several stationary and nonstationary alternative hypotheses. Using this approach, we derive temporally averaged estimators for all the usual financial statistics and apply them to recent historical stock market data to demonstrate their empirical properties.

Return to Where You Were on the Program Page








































Utilizing historical patients in a clinical trial that became open-label


Samantha Cook
Harvard University
cook@stat.harvard.edu

James MacDougall
Genzyme Corporation
James.MacDougall@genzyme.com

Elizabeth Stuart
Harvard University
stuart@stat.harvard.edu


In a recent FDA trial, the drug under investigation was made commercially available before the end of the trial. Patients in the trial therefore had the option of going off trial protocol and obtaining the commercially available active therapy. When patients randomized to placebo switch to commercially available therapy, they cease to be controls in the usual sense: All measurements after a placebo control switches to active therapy are treated as missing. We propose a method to impute placebo controls' missing outcomes, as if they had stayed on placebo. There are two key phases to this process.

The first phase selects the historical patients who look as if they could have plausibly been enrolled in a similar randomized trial at some point in their observation history. There are two particular complications to this selection process. The first is missing covariate data, which is dealt with by estimating propensity scores using a general location model. The second complication is the need to define baseline for each of the historical patients.

The second phase, the imputation phase, involves first, fitting a Bayesian hierarchical regression model to data from untreated historical patients; second, incorporating information learned from the historical patients into a similar model for placebo controls; and third, using this model and observed on-protocol data to impute missing values for placebo controls who switched to active therapy. Once missing values have been multiply imputed, the completed data sets can be analyzed as planned, and their results combined in a straightforward way.

This session will consist of three talks. Jim MacDougall of Genzyme will first present an overview of the trial. The second talk, by Elizabeth Stuart, describes the process of selecting the historical patients. In the third talk, Samantha Cook will describe the imputation model used to impute the missing placebo outcomes.

Return to Where You Were on the Program Page