Department Seminar Series: Lihong Li, Multi-World Testing: Unbiased Offline Evaluation in Contextual Bandits
Abstract: Optimizing an interactive learning system against a predefined metric is hard, especially when the metric is computed from user actions (like clicks and purchases). The key challenge is the counterfactual nature: in the example of Bing, any change to the search engine may result in different search result pages for the same query, but we normally cannot infer reliably from historical search log how users would react to the new search results. To compare two systems on a target metric, one typically runs an A/B test on live users, just like a randomized clinical trial. While A/B tests have been very successful, they are unfortunately expensive and time-inefficient.
Recently, offline evaluation (a.k.a. counterfactual analysis) of interactive learning systems, without the need for online A/B testing, has gained growing interests in both industry and the research community, with successes in several important applications. This approach effectively allows one to run (potentially infinitely) many A/B tests *offline* from historical log, making it possible to estimate and optimize online metrics easily and inexpensively. In this talk, I will formulate the problem in the framework of contextual bandit, explain the basic techniques for unbiased offline evaluation as well as several improvements, and describe successful stories in two important applications: personalized news recommendation and Web search. It is anticipated that this approach will find substantial use in many other learning problems, yielding greater offline experimental agility with improved online performance.