Statistics Department Hosted Seminar Series by Professor Yves Atchade: Saharon Rosset, Quality Preserving Databases: Statistically Sound and Efficient Use of Public Databases for an Infinite Sequence of Tests
Professor Yves Atchade is hosting a bi-weekly seminar series called "Statistical Computing" which will discuss how statistical methods are implemented, and to explore computational techniques with potential applications in statistics.
Large databases whose usage is open to the scientific community to facilitate research are becoming commonplace, especiallyin Biology and Genetics. The emerging scenario in which a community of researchers sequentially conduct multiple statistical tests on one shared database gives rise to major multiple hypothesis testing issues. It is often hard to control false discovery in the presence of unpredictable and sequential use, and existing tools are very limited. We suggest a scheme we term Quality Preserving Database (QPD) for controlling false discovery without any power loss by adding new samples for each use of the database and charging the user with the expenses. The crux of the scheme is a carefully crafted pricing system that fairly prices different user requests based on their demands while controlling false discovery. The statistical problem encountered is one of defining appropriate measures of false discovery that can be controlled sequentially, and designing methodologies that can control them in the context of QPD. We describe a simple QPD implementation based on controlling the family-wise error rate using a method called alpha-spending, and a more involved implementation based on controlling a measure called mFDR, using an approach we term generalized alpha investing. We derive the favorable statistical properties of generalized alpha investing variants in general, and in the context of QPD in particular. The variant we implement can guarantee infinite use of a public database while preserving power, with very low costs, or even no costs under some realistic assumptions. We demonstrate this idea in simulations and describe its potential application to several real life setups.
Joint work with Ehud Aharoni and Hani Neuvirth of IBM Research.