We consider the problem of variable selection in the very challenging regime where the signals are rare and weak and columns of the design matrix are heavily correlated. We demonstrate that in the presence of rare/weak signals, many classical methods and ambitious contemporary algorithms face pitfalls. The situation is worsen in the presence of heavy correlations among design variables.
We propose a new variable selection approach which we call Covariance Assisted Screening and Estimation (CASE). CASE is a multi-stage multivariate Screen and Clean algorithm. CASE has two layers of innovations. In the first layer, we alleviate the heavy correlations of the design variables by linear filtering and use the post-filtering model to construct a sparse graph. In the second layer, we use the sparse graph to guide both the screening and cleaning.
We explain how CASE overcomes the well-known computational hurdle of multivariate screening. We also explain how CASE overcomes the so-called challenge of "signal cancellation", so its success is not tied to strong signals or any types of incoherence/irrepresentable conditions.
We set up a theoretic framework where we show CASE obtains the optimal rate of convergence in terms of Hamming errors. We have successfully applied CASE to a long-memory time series and a change-point model, where the optimality is further investigated with the so-called notion of phase diagram.