Oral Prelim: Shrijita Bhattacharya, Statistics on data streams with applications to the analysis high volume network flows
The security and management of traffic in modern fast computer networks involves the rapid analysis of large volumes of data. Such types of data, referred to as data streams cannot be analyzed with conventional statistical tools that often require storing and post-processing the entire data set. Data streams can only be accessed sequentially under relatively stringent computing time and space constraints. This poses novel challenges to the types of statistics that can be used in this setting.
In this work, we start by addressing the canonical problem of the detection and estimation of the number of high activity entities, also called heavy hitters in a data stream. The problem is motivated by the detection of anomalies such as the onset of denial of service attacks in computer networks. To handle the paucity of memory relative to the size of the data stream, we use data structures derived from pseudo-random hash functions. We develop statistics that can be efficiently computed on fast data streams and allow one to track the number of heavy users in real time. We provide estimates on the accuracy of these estimators and identify their optimal regime with respect to tuning parameters. In the future, we plan to extend these techniques to the on-line analysis of multiple data streams. Our long term goal is to study the global structure anddynamics of network flows in real time, which could help network engineers optimize and manage traffic.