| 2004 NESS PROGRAM | |
9:00 - 9:30 a.m. | REGISTRATION - Coffee and Refreshments | Arcade - 1st Floor Science Center Lobby |
9:30 - 10:00 a.m. | WELCOME | Hall E - Science Center Basement Level |
| Donald B. Rubin, Chairman John L. Loeb Professor of Statistics | |
| MORNING CONTRIBUTED SESSIONS |
11:30 - 11:45 a.m. | BREAK - Coffee and Refreshments | Arcade - 1st Floor Science Center Lobby |
| MORNING KEYNOTE PRESENTATION |
12:45 - 2:00 p.m. | LUNCH | Greenhouse Cafe - 1st Floor Science Center |
| AFTERNOON KEYNOTE PRESENTATION |
3:00 - 3:15 p.m. | BREAK - Coffee and Refreshments | Arcade - 1st Floor Science Center Lobby |
| AFTERNOON INVITED SESSION | |
| AFTERNOON CONTRIBUTED SESSIONS |
4:45 - 6:30 p.m. | RECEPTION AND INFORMAL PARTY | Harvard University Statistics Department, 7th Floor |
Return to the Top of This Page
Return to the 2004 New England Statistics Symposium Home Page
Return to the Harvard University Statistics Department Home Page
Estimation in stationary Markov renewal processes, with application to earthquake forecasting in Turkey
Enrique E. Alvarez
Department of Statistics
University of Connecticut
ealvarez@merlot.stat.uconn.edu
Consider a process in which different events occur, with random inter-occurrence times. In Markov renewal processes, the sequence of events is a Markov chain and the waiting distributions depend only on the types of the last and the next event. Suppose that the state-space is finite and that the process started far in the past, achieving stationary. Weibull distributions are proposed for the waiting times and their parameters are estimated jointly with the transition probabilities through maximum likelihood, when one or several realizations of the process are observed over finite windows. The model is illustrated with data of earthquakes of three types of severity that occurred in Turkey during the 20th century.
Return to Where You Were on the Program Page
A Bayesian Alternative to the Chi-Squared Test of Association in a Two-Way Categorical Table with Intra-Class Correlation
Balgobin Nandram* and Jai Won Choi**
Worcester Polytechnic Institute* and National Center for Health Statistics**
balnan@wpi.edu* and jwc7@cdc.gov**
Return to Where You Were on the Program Page
Investigating the Effects of Working Conditions on Health Care Quality via Structural Equation Models
Sonali Das*, Ming-Hui Chen*, Nicholas Warren+, and Dipak Dey*
*Department of Statistics
University of Connecticut
+Assistant Professor of Medicine/Ergonomics Coordinator,
University of Connecticut Health Center
*sonali@stat.uconn.edu
*mhchen@stat.uconn.edu
+warren@nso.uchc.edu
*dey@stat.uconn.edu
The issue of health care quality has recently received significant attention both within and outside the medical community. Working conditions in health care settings, including job health and safety, have direct and indirect impacts on health care quality. Much previous research has tended to focus on a single level or group factors that affect health care quality. In this work based on a large national survey, we identify a web of factors influencing employee's perception of the organization, and outcomes of employee behavior via variables such as satisfaction, stress, turnover intention and perception of quality. In this study, we discuss different structural equation models (SEMs) based on path analysis consisting of latent and indicator (manifest) variables. The aim here is to model the relationships to get the "best" predicted covariance structure. We discuss some results, their implications and modifications to the models.
Return to Where You Were on the Program Page
Clustered Poisson regression or why I don't like GEE
Eugene Demidenko
Dartmouth College, NH
eugened@dartmouth.edu
Statistical properties of a Poisson regression with random intercepts are studied. This is a perfect model to compare quasi-likelihood estimation, such as GEE, with maximum likelihood because both use the same model up to a factor. In other words, the marginal model (GEE) and conditional model (random effect) are the same for Poisson regression with random intercepts. Five methods of estimation for clustered Poisson regression are considered: standard Poisson regression (naïve), fixed-intercept Poisson model, GEE, Exact GEE (EGEE), and MLE. The beauty of the Poisson model is that the exact covariance matrix can be computed in closed from that is called EGEE. All five methods produce consistent estimates for regression coefficients but have different efficiency in large samples. We derive the asymptotic covariance matrix for each method for any distribution of the random intercept. We analytically compare and test the methods via simulations. The five split into two groups: naïve Poisson & GEE and the rest. Although the compound symmetry structure seems natural for the random-intercept model this working correlation structure never coincides with the true one. Consequently, GEE loses much efficiency and becomes not more efficient than naïve Poisson regression, which ignores cluster correlation. On the other hand, fixed-intercept approach, EGEE, and MLE are very close and are the same if the data are balanced.
Return to Where You Were on the Program Page
Some Information Bounds and Asymptotic Variances
T.M. Durairajan
Department of Statistics
Loyola College
Chennai, India
tmdurairajan2001@yahoo.co.uk
In large sample methods for inference, there are several examples of asymptotic distributions of statistics whose asymptotic variances are not related to Fisher Information. In this paper, we define various information measures and obtain different information bounds which are the inverse of asymtotic variances in different situations. We also relate these information bounds to the estimation of parameters of interest in the presence of the nuisance parameters in finite sample.
Return to Where You Were on the Program Page
Bayesian Simultaneous Intervals for Small Area Estimation: An Application to Mapping Mortality Rates in U.S. Health Service Areas
Erik Barry Erhardt
Graduate Student, WPI
erike@WPI.EDU
It is customary when presenting a choropleth map of rates or counts to present only the estimates (mean or mode) of the parameters of interest. While this technique illustrates spatial variation, it ignores the variation inherent in the estimates. We describe an approach to present variability in choropleth maps by constructing 100(1-alpha)% simultaneous intervals. The result provides three maps (estimate with two bands).
We propose two methods to construct simultaneous intervals from the optimal individual highest posterior density (HPD) intervals to ensure joint simultaneous coverage of 100(1-alpha)%.
Both methods exhibit the main feature of multiplying the lower bound and dividing the upper bound of the individual HPD intervals by parameters 0
For illustrative purposes we apply our methods to chronic obstructive pulmonary disease (COPD) mortality rates from 1988--92, subset White Males age group 65 and older, for the continental United States consisting of 798 Health Service Areas (HSA).
Return to Where You Were on the Program Page
How to measure age, race and gender effects in ticketing for speeding in Massachusetts
Dominique Haughton
Bentley College
Phong Nguyen
Bentley College and General Statistics Office, Hanoi
dhaughton@bentley.edu
The claim is sometimes made, as most recently in the Boston Globe (7/20/2003) that "race, sex and age drive ticketing" by police on Massachusetts roads. We use a database of speeding tickets and warnings obtained by the Globe to build a logistic model of who gets ticketed and who only gets warned. The model involves a rather complicated non linear function of speed and speed over the speed limit as well as some interactions, identified with the help of MARS (Multiple Adaptive Regression Splines). In order to discuss the importance of the race, gender and age effects relative to the speed effects in the model, we propose a graphical method, since dividing the coefficients by the standard deviation of a variable as suggested in the literature is unfeasible in a complicated model such as ours. In addition to the speed effects, we find a strong Hispanic effect, some age effects, and a moderate gender effect.
Return to Where You Were on the Program Page
Modeling Repeated Binary Responses and Time-Dependent Missing Covariates with Application to a Tree with a Two-Year Periodicity in Flowering Intensity
Lan Huang*, Ming-Hui Chen*, Paul R. Neal+, and Gregory J. Anderson+
Department of Statistics* and Ecology and Evolutionary Biology+
University of Connecticut
lan@merlot.stat.uconn.edu
In this paper, we develop a novel modeling strategy for analyzing data with repeated binary responses over time as well as with time-dependent missing covariates. We use the generalized linear mixed model (GLMM) for the repeated binary responses. We then propose a joint model for time-dependent missing covariates using information from other sources. The proposed methodology is well motivated by a real application, namely, a study of Tilia americana (American basswood), a tree with a two-year periodicity in flowering intensity. The data consist of an index of flowering intensity collected from 1974 to 2002. The proposed methodology will be used to identify factors such as defoliation by gypsy moths (Lymantria dispar) and weather conditions that may disrupt the cyclical pattern of flowering.
Return to Where You Were on the Program Page
Prediction of Co-Regulated Genes using Motif Discovery and Clustering
Shane Jensen
Harvard University
jensen@stat.harvard.edu
Genes are often regulated in living cells by proteins called transcription factors (TFs) that bind directly to short segments of DNA in close proximity to certain target genes. These short segments have a conserved appearance, which is called a motif. Statistical methods for motif discovery are briefly reviewed. We propose a Bayesian hierarchical clustering model for the common structure between a set of discovered motifs. This clustering model is implemented using a Gibbs sampling strategy and several approaches to analyzing the clustering results are discussed. Techniques for motif discovery and motif clustering are used in combination to predict co-regulated genes in the bacteria Bacillus subtilis. Sequences from several closely related species were used to discover motifs conserved by evolution, and these conserved motifs were then used to cluster genes together into putative co-regulated groups. These predicted clusters are validated and examined in detail using several external measures of cell regulation.
Return to Where You Were on the Program Page
Statistical methods for identifying differential gene-gene interaction patterns
Yinglei Lai, Baolin Wu, Liang Chen and Hongyu Zhao
Center for Statistical Genomics and Proteomics
Department of Epidemiology and Public Health
Yale University School of Medicine
yl335@email.med.yale.edu
To understand cancer mechanisms, it is important to explore molecular changes in cellular processes from normal state to cancerous state. In this study, we address statistical methods for identifying differential gene-gene interaction patterns in different cell states. For efficient pattern recognition, we extend the traditional F-statistic and obtain an Expected Conditional F-statistic, which systematically integrates statistical information about differences of locations and correlations. We also propose a statistical method for data transformation to eliminate outlier problem. Our approach is applied to a microarray gene expression data set for prostate cancer study.
For a gene of interest, our method can select other genes that have differential gene-gene interaction patterns with this gene in different cell states. Among 10 most frequently selected genes, there are genes hepsin, GSTP1 and AMACR. These 3 genes were recently proposed to be associated with prostate cancer. But, it is difficult to identify genes GSTP1 and AMACR by finding differentially expressed genes. Using tumor suppressor genes PTEN, RB1 and TP53, we identify 7 genes that also include hepsin, GSTP1 and AMACR. We show that genes associated with cancer may have differential gene-gene interaction patterns in different cell states. Our statistical approach is capable of discovering such patterns.
Return to Where You Were on the Program Page
ESTIMATING VARIANCE-COVARIANCE STRUCTURE OF THE NONPARAMETRIC REGRESSION DATA WITH TIME SERIES ERRORS
Chrisitian Dahl and Michael Levine
Purdue University
dahlc@mgmt.purdue.edu, mlevins@stat.purdue.edu
The univariate variance estimation in the context of nonparametric regression is by now a fairly extensively researched topic. Up until now, most of the work had been concerned with general properties of resulting variance estimators, such as asymptotic minimaxity. Potential applications were not extensively considered (as an exception, we want to mention the article "Efficient estimation of conditional variance functions in stochastic regression" by J. Fan and Q. Yao, first published in Biometrika in 1998).
We introduce a model where the data is heteroscedastic with time series based variance-covariance structure
Here ei = F ei-1 + ?i is a stationary AR(1) time series. The variance function f(x) is defined on [0,1] and satisfies very unobtrusive smoothness requirements. This model can be used to describe an exchange rate (with the extraneous covariate being, for example, the current interest rate) and easily generalized to include the trend function g(x). Our main goal is to construct consistent estimators of f(x) and F that are easy to compute and possess good asymptotic properties. We introduce both estimators and then discuss their asymptotic properties and convergence rates.
Return to Where You Were on the Program Page
A COMPARISON OF GOODNESS OF FIT TESTS FOR THE LOGISTIC GEE MODEL
Lingling Li, B.S.
Center for Biostatistics in AIDS Research
Department of Biostatistics
Harvard School of Public Health
lingling@sdac.harvard.edu
Scott Evans. Ph.D.
Center for Biostatistics in AIDS Research
Department of Biostatistics
Harvard School of Public Health
evans@sdac.harvard.edu
Generalized Estimating Equations have become a popular regression method for analyzing clustered binary data. Statistics to assess the goodness of fit of the fitted models have recently been developed including: statistics based on residuals; a statistic using groups based on ranked estimated probabilities; statistics based on covariate partitioning; and a classification statistic. However, evaluations and comparisons of these methods are limited. We discuss these methods and develop two additional statistics to evaluate goodness of fit. We evaluate the performance of each of the statistics with respect to Type I error rates and power in a simulation study. Guidance is provided regarding appropriate use of the statistics under various scenarios.
Return to Where You Were on the Program Page
A Statistical Model for Multiple High-throughput Protein-Protein Interaction Assay Assessments
Junfeng Liu, Hongyu Zhao
Yale Center for Statistical Genomics and Proteomics
Division of Biostatistics
Department of Epidemiology and Public Health
Yale School of Medicine
junfeng.liu@yale.edu, hongyu.zhao@yale.edu
For few high-throughput well known yeast 2-hybrid assays conducted in independent labs around the world, this article develops efficient and reliable algorithms to detect the crucial true positive rate, false positive (negative) rate, coverage rate, reliability. Given arbitrary assay designs, EM algorithms are developed to obtain the mode estimation with (without) true positive(negative) Gold Standard Dataset. Possible improved association structure modeling is proposed and tested in mock data sets. Finally we make comparisons with other assessment approaches in current biological literature.
Return to Where You Were on the Program Page
Bayesian inference on stochastic volatility under hidden semi-Markov models
Zhaohui Liu
Department of Statistics
University of Connecticut
Zhaohui.Liu@UConn.edu
In this paper we discuss stochastic volatility models with hidden semi-Markov regime switches. The models analyzed include univariate SV model for financial return series, and bivariate model with both the return and transaction volumes. In addition to the AR structure of the volatility process, it is assumed that the volatility follows a semi-Markov regime switch process. With this new modeling approach, the duration time at each state will take different distributions. Therefore the mean and variance of the volatility at different states could vary, and the duration times the whole volatility process spent on each state will be different. These facts will considerably enhance the modeling capacity and better describe the underlying volatility process that might be influenced by different economic forces.
Statistical inference will be carried out via the MCMC technique. In Bayesian inference framework, prediction can be easily computed. In addition, we will also discuss model selection by marginal likelihood method, pseudo-Bayes factor, prediction based L-measure, and DIC.
Return to Where You Were on the Program Page
TEST OF INDEPENDNECE BASED ON KENDALL'S PROCESS: TABULATE THE PERCENTILES OF CRAMER-VON MESIS STATISTICS BY THE STURN-LIOUVILLE APPROACH
Kilani Ghoudi
United Arab Emirate University
ghoudi@uaeu.ac.ae
L'moudden Ahmed
Université de Sherbrooke
Québec, Canada
lmoudden@dmi.usherb.ca
Jean Vaillancourt
Universite de Quebec en Outaouais
Canada
vaillancourt@uqo.ca
Let Z1,¼,Zn be n ³ 2 independent copies of a vector Z=(Z(1),Z(2) ) with distribution function H(z). To test if H(z) has independent components, we are going to use the Cramer-von Mesis statistics based on Kendall's process limit, whose covariance function is explicitly known. Genest, Quessy and Rémillard (2002) calculated the percentiles for this statistic by simulation. We propose a very effective numerical approach to calculate this critical value by using the covariance function and the differential equation of Sturn-Liouville.
Return to Where You Were on the Program Page
GLOBAL HUMAN DEVELOPMENT: EXPLAINING ITS REGIONAL VARIATIONS*
Robert B. Smith
Social Structural Research Inc.
Cambridge, MA
rsmithphd @ aol.com
The United Nations Development Program (UNDP) ranks countries annually on its human development index (HDI), which combines a country's measures of longevity, literacy, and per capita income. This paper applies hierarchical modeling to quantify the factors that predict a country's HDI rank, explain the variability between regions, s2R , and explain the variability between countries within a region, s2c. It assesses the effects of nine civilizations: African, Buddhist, Hindu, Japanese, Latin, Moslem, Orthodox, Sinic, and Western. Civilization strongly predicts a country's rank on the HDI, but it does not provide the strongest causal explanation of the variability in the HDI quantified by s2R and s2c. Among the covariates studied here, present-day slavery (debt bondage, forced labor, chattel slavery, and prostitution) and the lack of political freedom explain much of the variability that is between regions, and corruption explains much of the variability among countries within a region. Additionally, countries with high rates of conflict and social unrest and debt have significantly worse positions on the HDI. Civilizations are best viewed as pointers to underlying social mechanisms like women's education that more directly determine development; its advance may enhance development.
*Author's Note: With contributions by Kevin Bales, who provided several of his measures for analysis, and by Irina Koltoniuc, who helped prepare the analytic database. Helen Fein underscored the importance of slavery and Philip Gibbs of the SAS Institute clarified some of the nuances of PROC MIXED. Andy Baker, Stanley Guterman, and Sreemoti Mukerjee-Roy critiqued earlier drafts. The views expressed here are the author's and do not necessarily reflect the opinions or policies of the United Nations Development Program or any other organization.
Return to Where You Were on the Program Page
Zero-inflated Poisson Regression Models
Lynn Kuo, Henry R. Kranzler, and Chang Hong Song
Department of Statistics
University of Connecticut
changhon@merlot.stat.uconn.edu
This paper applies the zero-inflated Poisson regression model with random effects to evaluating the effectiveness of the drug naltrexone for treatment of problem drinkers. Subjects were randomly assigned to four groups: daily placebo, targeted placebo (i.e., on a reduced schedule), daily naltrexone, and targeted naltrexone. Data were collected on alcohol consumption using structured nightly diaries. The outcome variable, the number of drinks per day, has excess zeros when fitted with a Poisson regression model. Therefore, we developed a longitudinal zero-inflated Poisson regression model to evaluate the treatment effect. The results indicate that the daily naltrexone treatment is the most effective, followed by targeted naltrexone and targeted placebo. The results also indicate that women drink much less than men, and that over time subjects tend to reduce their alcohol consumption. Daily data collection is receiving increased attention in clinical trials; this statistical approach provides a new method for analysis for these data.
Return to Where You Were on the Program Page
Hierarchical models with migration, mutation, and drift: implications for genetic inference
Seongho Song
Department of Statistics,
University of Connecticut
seongho@stat.uconn.edu
Dipak K. Dey
Department of Statistics,
University of Connecticut
dey@stat.uconn.edu
Kent E. Holsinger
Department of Ecology and Evolutionary,
University of Connecticut
kent@darwin.eeb.uconn.edu
Hierarchical structure arises naturally in genetic models. Individuals belonging to the same population are more similar to one another than are those belonging to different populations. Using properties of moment stationarity we develop exact expressions for the mean and covariance of allele frequencies at a single locus for the 2-level hierarchical structure subject to drift, mutation, and migration. For arbitrary mutation and migration matrices, we generalize previous results to multilevel hierarchical model. Consequently, we have closed-form expressions for the mean and covariance of allele frequencies in Wright's finite-island model with constant hierarchical migration and several mutations. It turns out that the correlation among populations and among subpopulations and the correlation between populations and subpopulations vanish for the large size of population and subpopulation. Also we discuss some implications of our results based on Wright's F-statistics as measures of population structure.
Keywords: F-Statistics; Finite-Island Model; Genetic Drift; Hierarchical Population Structure Migration; Mutation
Return to Where You Were on the Program Page
ASYMPTOTICALLY EFFICIENT ESTIMATION OF A SURVIVAL FUNCTION IN THE MISSING CENSORING INDICATOR MODEL
Sundar Subramanian
Department of Mathematics and Statistics
University of Maine
Orono
subraman@germain.umemat.maine.edu
We describe a new estimator of a survival function in the random censorship model when the censoring indicator is missing at random for some study subjects. The proposed approach appeals to a known representation for the survival function, expressible as a smooth functional of a certain conditional probability and the cumulative hazard function of the observed minimum. Well-known estimators are substituted into this representation leading to a simple estimator of the survival function. The new estimator, whose asymptotic variance reduces to that of the Kaplan--Meier estimator when all the censoring indicators are observed, is asymptotically efficient.
Return to Where You Were on the Program Page
Statistical Challenges in the Analysis of Mass Extinctions
Steve C. Wang
Department of Mathematics and Statistics
Swarthmore College
scwang@swarthmore.edu
Much of our knowledge of the history of life comes from the fossil record. However, the fossil record is notoriously incomplete; in fact, usually more data are missing than are observed. This incompleteness presents interesting challenges for paleontologists and statisticians. Here we describe approaches for modeling the incompleteness of the fossil record in the context of mass extinctions. These extinctions - such as the end-Cretaceous event in which the dinosaurs perished - have profoundly shaped the course of life on earth. To infer the causes of mass extinctions, it is important to estimate the times of extinction of the species involved. For instance, how can we determine if a set of species went extinct simultaneously or gradually? If they went extinct simultaneously, how can we estimate their common time of extinction? If they went extinct gradually, how long did the extinctions last? We will discuss methods for answering such questions that take into account the incompleteness of the fossil record.
Return to Where You Were on the Program Page
A Two-Stage Nearest-Neighbor Classifier with Application to Microbial Source Tracking
Jayson D. Wilbur
Department of Mathematical Sciences
Worcester Polytechnic Institute
jwilbur@wpi.edu
In general, nearest-neighbor methods classify an object based on the group membership of the training observations within a certain neighborhood of the object in question. These methods share both the advantages and the disadvantages of other methods for distribution-free inference. In this talk a two-stage nearest-neighbor classifier is proposed which attempts to exploit the advantages of the (single-stage) nearest-neighbor classifier while simultaneously reducing the extent to which the classifier is overfit to the training data. This present work is motivated by the problem of microbial source tracking, which attempts to trace the source of bacterial pathogens in water resources using genetic fingerprints. Applications of the proposed methodology to real and simulated data will be presented as time permits.
Return to Where You Were on the Program Page
Imputing Missing Data by Monotone Blocks
Yaming Yu
Harvard University
yu@stat.harvard.edu
The presence of missing data hinders analyses that may be performed by many different users. Multiple imputation (MI) proposed by Rubin 1976 is an effective methodology to handle the problem. The current state-of-the-art procedures for imputing missing data typically fit fully Bayesian models assuming some joint probability distribution for the underlying complete data. Although principled, joint modeling may not capture important relations among the variables. On the other hand, when the missing data pattern is monotone, we may impute missing data variable by variable in a sequential fashion. This method is principled and flexible; however, it only applies to missing data that conform to a monotone pattern.
We propose a new method, {\it imputation by monotone blocks} (IMB), to impute missing data for public-use databases. Here a set of conditional models are specified and the missing data are iteratively imputed and re-imputed based on these conditional models. At each step of the imputation a monotone block of missing data is updated. We investigate the frequency properties of this method (bias, interval length, and coverage probability for complete data statistics, etc.) by simulation and derive guidelines on good update strategies for use in practice.
Return to Where You Were on the Program Page
Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments
Lynn Kuo, Fang Yu, and Yifang Zhao
University of Connecticut
lynn@stat.uconn.edu
fangyu@stat.uconn.edu
yifang@stat.uconn.edu
Several statistical methods are available for selecting differentially expressed genes from the microarray data with replication. These methods include the Benjamini and Hochberg method in multiple comparison, significant analysis of microarray (SAM) by Tusher, Tibshiranni, and Chu, the Bayesian t method by Baldi and Long, and the empirical Bayes semiparametric method by Newton. This paper reviews and compares these four methods. ROC curves are constructed to compare them based on three simulated data sets. We will discuss the results. The advantages of using the empirical Bayes semiparametric method are further illustrated by a real data set using Affymetric microarrays.
Return to Where You Were on the Program Page
Equi-Energy Sampler With Applications to Mixture Model Simulation and Density of States Calculation
Qing Zhou
(Joint work with Samuel Kou and Wing H. Wong)
Harvard University
zhou@stat.harvard.edu
We introduce the Equi-Energy sampler (EE sampler), which is constructed based on a temperature ladder and a series of truncated energy functions. The sample space of the target distribution is partitioned into N ranges according to their energy levels, and a population of N distributions associated with each temperature and truncated energy function is simultaneously sampled. The samples generated from each distribution are stored separately in their corresponding energy ranges. One crucial step in the EE sampler is the step of Equi-Energy jump, which allows the sample in a given chain to jump to a configuration in the neighboring chains with similar energy level. The EE jump greatly improves the mixing of the system and decreases the sample autocorrelations. We will illustrate the power of the EE sampler through the simulation of multimodal distributions and show by examples that our method can be efficiently utilized to calculate the density of states and construct estimates for a range of problems.
Return to Where You Were on the Program Page
Shifting paradigms: on the robustness of economic models to heavy-tailedness assumptions
Rustam Ibragimov
Department of Economics
Yale University
rustam.ibragimov@yale.edu
The structure of many models in economics and finance depends on majorization properties of convolutions of distributions. In this paper, we analyze robustness of these properties and the models based on them to heavy-tailedness assumptions. We show, in particular, that majorization properties of linear combinations of log-concavely distributed signals are reversed for very long-tailed distributions. As applications of the results, we study robustness of monotone consistency of the sample mean, value at risk analysis and the model of demand-driven innovation and spatial competition as well as that of optimal bundling strategies for a multiproduct monopolist in the case of an arbitrary degree of complementarity or substitutability among the goods. The implications of the models remain valid for not too heavy-tailed distributions. However, their main properties are reversed in the very thick-tailed setting.
Return to Where You Were on the Program Page
Market Efficiency, Bayes' Rule and the Recency Bias
Steve Jordan
Ph.D. Candidate in Financial Economics
Yale School of Management
steven.jordan@yale.edu
There is a large psychological and behavioral literature that claims that Bayes' rule is not a valid model of the human decision process. The strongest evidence proposed in financial markets against Bayes' rule is long-term overreaction (De Bondt and Thaler (1985)) and excess volatility (Shiller (1981)). Under a random walk model of market efficiency simulations support these claims. I propose a random walk with noise model as an intuitively appealing alternative for market efficiency. I demonstrate that within a Bayesian framework that both these market anomalies are consistent and predicted under a random walk with noise, resolving some long-standing debates. The intuition is simple. Under a random walk prices must adjust fully and immediately to new information. This constraint binds price behavior and causes violations of many observed market phenomenon. However, under a random walk with noise, there is an extra degree of freedom, the signal to noise ratio, which breaks this overly restrictive constraint of immediate and full reaction. My model allows me to be the first to empirically estimate the signal to noise ratio in the market. I find a significant level of noise with the variance of noise about 5 times that of information.
Return to Where You Were on the Program Page
Some Remarks about the Methodology and the Mythology of Financial Markets
Andrew Layasoff
Department of Mathematics and Statistics
Boston University
alyasoff@math.bu.edu
Many aspects of modern finance stem from the belief that, generally, one can think of daily rates of return from stocks as independent samples from one and the same (stock specific) probability law. I will briefly review some of the key statistical tests that have been used in the past to justify this claim and will show with a concrete example that the results from these tests are just as compatible with time series that exhibit self-regulatory behavior and are therefore very different from random walks. Another common belief among some practitioners is that, unless the returns behave as random walks, there will be fortunes to be made on the stock market out of nothing. This is quite surprising, because the conditions under which markets allow or do not allow arbitrage have been studied extensively in the last 20 years and it has long been established that very general models for stock prices are arbitrage free. I will briefly review the principles from which such conditions can be derived and will propose an alternative to the random walk model for stock returns.
Return to Where You Were on the Program Page
The future of the statistics profession depends on success in recruiting students to our subject. There will always be a need for statistics, of course, but there is a growing risk that the work of statistics will be taken over by people who think of themselves not as statisticians but as molecular biologists, market researchers, computer scientists, and others who specialize in any of the applied fields that use statistics.
What all these applications of statistics have in common, the glue that holds our universe together, is a reliance on mathematical models of uncertainty. Even as statistics has expanded the breadth and value of its achievements in applied areas, the core of our subject remains mathematical, and our future depends on attracting mathematically talented students. At most undergraduate colleges, the burden of doing that attracting falls to the mathematical statistics course, a course that has changed very little in half a century, despite extraordinary changes in the practice of statistics.
In my talk I will offer some criticisms of the traditional math stat course, and suggest some remedies and alternatives.
Return to Where You Were on the Program Page
A common practice among quantitative financial analysts for dealing with non-stationary time series is to apply an exponentially declining weighting scheme to the data so that more distant observations are given less weight than more recent observations. We show that this practice of temporal averaging is incorrect for all but linear estimators, implying that variances, covariances, betas, Sharpe ratios, Value-at-Risk, and many other common financial statistics are incorrectly estimated with exponentially weighted time series. We propose an alternative approach to temporal averaging that yields unbiased estimators under the null hypothesis of independently and identically distributed observations, and which has attractive asymptotic properties under several stationary and nonstationary alternative hypotheses. Using this approach, we derive temporally averaged estimators for all the usual financial statistics and apply them to recent historical stock market data to demonstrate their empirical properties.
Return to Where You Were on the Program Page
Utilizing historical patients in a clinical trial that became open-label
In a recent FDA trial, the drug under investigation was made commercially available before the end of the trial. Patients in the trial therefore had the option of going off trial protocol and obtaining the commercially available active therapy. When patients randomized to placebo switch to commercially available therapy, they cease to be controls in the usual sense: All measurements after a placebo control switches to active therapy are treated as missing. We propose a method to impute placebo controls' missing outcomes, as if they had stayed on placebo. There are two key phases to this process.
The first phase selects the historical patients who look as if they could have plausibly been enrolled in a similar randomized trial at some point in their observation history. There are two particular complications to this selection process. The first is missing covariate data, which is dealt with by estimating propensity scores using a general location model. The second complication is the need to define baseline for each of the historical patients.
The second phase, the imputation phase, involves first, fitting a Bayesian hierarchical regression model to data from untreated historical patients; second, incorporating information learned from the historical patients into a similar model for placebo controls; and third, using this model and observed on-protocol data to impute missing values for placebo controls who switched to active therapy. Once missing values have been multiply imputed, the completed data sets can be analyzed as planned, and their results combined in a straightforward way.
This session will consist of three talks. Jim MacDougall of Genzyme will first present an overview of the trial. The second talk, by Elizabeth Stuart, describes the process of selecting the historical patients. In the third talk, Samantha Cook will describe the imputation model used to impute the missing placebo outcomes.
Return to Where You Were on the Program Page