Referring Layer Decomposition
Original
Prompt
Layered Result
Abstract
Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image–layer–prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.
RefLade: Data Engine, Dataset and Evaluation Protocol
Data Engine
We introduce a scalable and modular automated data engine capable of generating diverse, realistic, and high-fidelity RGBA layers from natural images at scale.
Overview of the data engine
Dataset
Image source
| Dataset | Task | # Images | Average Resolutions |
# Cls | # Instances | Occlusion Rate |
Image Source |
| SAIL-VOS | Amodal | 111,654 | 800×1280 | 162 | 1,896,296 | 56.3% | Synthetic |
| OVD | Amodal | 34,100 | 500×375 | 196 | - | - | Real |
| WALT | Amodal | 15M | - | 2 | 36M | - | Real |
| AHP | Amodal | 56,599 | - | 1 | 56,599 | - | Real |
| DYCE | Amodal | 5,500 | 1000×1000 | 79 | 85,975 | 27.7% | Real |
| OMLD | Amodal | 13,000 | 384×512 | 40 | - | - | Synthetic |
| CSD | Amodal | 11,434 | 512×512 | 40 | 129,336 | 26.3% | Synthetic |
| MuLAn | LD | 44,860 | - | 759 | 101,269 | 7.7% | Real |
| RefLade | RLD | 430,488 | 1831×1437 | 12K | 871,829 | 60.8% | Real |
Comparison of RefLade with related existing datasets
Instance distribution of RefLade Dataset
Evaluation Protocol
The Human Preference Aligned (HPA) Score
Following human judgment, we evaluate the decomposition quality from three aspects:
Aspect 1: Preservation. Preserving original visible content.
\[\mathcal{S}_{\text{vis}} = \mathbb{E}_{(p, g) \sim \mathcal{D}} [ \text{LPIPS}(g_{\text{rgb}} \odot g_v,\, p_{\text{rgb}} \odot g_v) ] \]
Aspect 2: Completion. Generating reasonable completions for the occluded regions.
\[\mathcal{S}_{\text{gen}} = \mathbb{E}_{(p, g) \sim \mathcal{D}} \left[\cos\left( f(g_{\text{rgb}}) - f(g_{\text{rgb}} \odot g_v), \,f(p_{\text{rgb}}) - f(g_{\text{rgb}} \odot g_v) \right)\right]\]
Aspect 3: Faithfulness. The distributional similarity between predictions and ground-truth layers.
\[\hat{p} = p_{\text{rgb}} \odot p_a + i_{\text{bkgd}} \odot (1 - p_a), \quad \hat{g} = g_{\text{rgb}} \odot g_a + i_{\text{bkgd}} \odot (1 - g_a)\] \[\mathcal{S}_{\text{fid}} = \text{FID}\left( \left\{ \hat{p} \mid p \in \mathcal{D} \right\}, \left\{ \hat{g} \mid g \in \mathcal{D} \right\} \right)\]
Aggregation. We apply min-max normalization to each metric and then averaging them to produce the final HPA score.
Alignment with Human Preference
Human ELO vs. the HPA Score
| HPA | Svis | Sgen | Sfid | Svis + Sgen | Sfid + Sgen | Svis + Sfid | |
| Pearson correlation | 0.96 | 0.90 | 0.96 | 0.94 | 0.95 | 0.95 | 0.94 |
| Spearman correlation | 1 | 0.60 | 0.98 | 0.67 | 0.92 | 0.97 | 1 |
Pearson and Spearman correlations with human ELO across different metrics
RefLayer: A Baseline Model
RefLayer model architecture.
| Dataset | #layers | Foreground | Background | ||||||
| HPA ↑ | FID ↓ | LPIPS ↓ | DIR ↑ | HPA ↑ | FID ↓ | LPIPS ↓ | DIR ↑ | ||
| MuLAn | 50K | 0.3852 | 22.68 | 0.1403 | 0.2031 | 0.3459 | 21.84 | 0.1588 | 0.6385 |
| RefLade | 50K | 0.4629 | 10.98 | 0.1411 | 0.2543 | 0.5932 | 16.87 | 0.0520 | 0.7206 |
| RefLade | 100K | 0.4621 | 11.27 | 0.1428 | 0.2589 | 0.5935 | 16.73 | 0.0530 | 0.7213 |
| RefLade | 200K | 0.4631 | 10.99 | 0.1434 | 0.2547 | 0.5461 | 19.84 | 0.0552 | 0.6950 |
| RefLade | 400K | 0.4678 | 10.66 | 0.1404 | 0.2575 | 0.5792 | 18.36 | 0.0493 | 0.7129 |
| RefLade | 1M | 0.4685 | 11.10 | 0.1377 | 0.2561 | 0.5587 | 17.35 | 0.0730 | 0.7190 |
| RefLadeQ | 100K | 0.4698 | 10.60 | 0.1378 | 0.2531 | 0.6657 | 12.99 | 0.0487 | 0.7721 |
| RefLade+Q | 1.1M | 0.4813 | 10.50 | 0.1330 | 0.2652 | 0.6682 | 13.14 | 0.0437 | 0.7673 |
Benchmarking RefLade with Different Training Set and Scale. Results are reported on the RefLade testing set with multimodal text+box prompts
Qualitative Results
Model Comparison
We compare our RefLayer, trained on the RefLade dataset, with the same model trained on the MuLAn dataset and with Google Gemini 3 (Nano Banana Pro). Our model generally produces higher-quality predictions. The latest general-purpose generative models (Nano Banana Pro) have limitations in preservation and completion, and cannot produce true RGBA images with an alpha channel.
Comparison between RefLayer trained on the MuLAn dataset and the same model trained on our RefLade dataset
Comparison between RefLayer and Google Gemini 3 (Nano Banana Pro)
BibTeX
@inproceedings{rld,
title = {Referring Layer Decomposition},
author = {Chen, Fangyi and Shen, Yaojie and Xu, Lu and Yuan, Ye and Zhang, Shu and Niu, Yulei and Wen, Longyin},
booktitle = {ICLR 2026 (Virtual)},
year = {2026},
url = {https://iclr.cc/virtual/2026/poster/10011003}
}