The application of machine learning to large datasets has become avital component of many important and sophisticated software systems being built today. Such trained systems are frequently based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. For example, a good feature for a search engine’s relevance ranker might be the number of times a query term appeared in a given Web page. The success of a modern trained system depends substantially on the quality of its features.
Unfortunately, feature engineering - the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm - is a tedious, time-consuming experience. Because "big data" inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes; because the inputs are so large, each code change can involve a time-consuming data processing task, perhaps over each page in a Web crawl. In this talk I will describe a data-centric software system that accelerates feature engineering through intelligent input selection, optimizing the "inner loop" of the feature engineering process. The system yields feature evaluation speedups of up to 8x in some cases and reduces engineer wait times from 8 to 5 hours in others. Finally, our system obtains high-quality results even in the face of programming errors.