Statistics Department Seminar Series: Yixin Wang, Assistant Professor, Department of Statistics, University of Michigan.
"Causal Inference for Unstructured Data"
Abstract: Causal inference traditionally relies on tabular data, where treatments, outcomes, and covariates are manually collected and labeled. However, many real-world problems involve unstructured data—images, text, and videos—where treatments or outcomes are high-dimensional and unstructured, or all causal variables are hidden within the unstructured observations. This talk explores causal inference in such settings.
We begin with cases where all causal variables (including treatments, outcomes, covariates) are hidden in unstructured observations. These causal problems require a crucial first step, extracting high-level latent causal factors from raw unstructured inputs. We develop algorithms to identify these factors. While traditional methods often assume statistical independence, causal factors are often correlated or causally connected. Our key observation is that, despite correlations, the causal connections (or the lack of) among factors leave geometric signatures in the latent factors' support - the ranges of values each can take. These signatures allow us to provably identify latent causal factors from passive observations, interventions, or multi-domain datasets (up to different transformations).
Next, we tackle cases where unstructured data itself serves as either the treatment or the outcome. In these cases, standard causal queries like average treatment effect (ATE) are not suitable—subtracting one text, image, or video outcome from another is meaningless. High-dimensional unstructured treatments also challenge the overlap assumption required for causal identification. To address these challenges, we propose new causal queries: for unstructured outcomes, we pinpoint outcome features most affected by the treatment; for unstructured treatments, we identify influential treatment features driving outcome differences. Finally, we extend these ideas to decision-making algorithms, such as optimizing natural language actions for desired outcomes.
https://yixinwang.github.io/
We begin with cases where all causal variables (including treatments, outcomes, covariates) are hidden in unstructured observations. These causal problems require a crucial first step, extracting high-level latent causal factors from raw unstructured inputs. We develop algorithms to identify these factors. While traditional methods often assume statistical independence, causal factors are often correlated or causally connected. Our key observation is that, despite correlations, the causal connections (or the lack of) among factors leave geometric signatures in the latent factors' support - the ranges of values each can take. These signatures allow us to provably identify latent causal factors from passive observations, interventions, or multi-domain datasets (up to different transformations).
Next, we tackle cases where unstructured data itself serves as either the treatment or the outcome. In these cases, standard causal queries like average treatment effect (ATE) are not suitable—subtracting one text, image, or video outcome from another is meaningless. High-dimensional unstructured treatments also challenge the overlap assumption required for causal identification. To address these challenges, we propose new causal queries: for unstructured outcomes, we pinpoint outcome features most affected by the treatment; for unstructured treatments, we identify influential treatment features driving outcome differences. Finally, we extend these ideas to decision-making algorithms, such as optimizing natural language actions for desired outcomes.
https://yixinwang.github.io/
Building: | West Hall |
---|---|
Website: | |
Event Type: | Workshop / Seminar |
Tags: | seminar |
Source: | Happening @ Michigan from Department of Statistics, Department of Statistics Seminar Series |