Referring Layer Decomposition

Abstract

Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image–layer–prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.

RefLade: Data Engine, Dataset and Evaluation Protocol

Data Engine

We introduce a scalable and modular automated data engine capable of generating diverse, realistic, and high-fidelity RGBA layers from natural images at scale.

Overview of the data engine

Dataset

Image source

Dataset	Task	# Images	Average Resolutions	# Cls	# Instances	Occlusion Rate	Image Source
SAIL-VOS	Amodal	111,654	800×1280	162	1,896,296	56.3%	Synthetic
OVD	Amodal	34,100	500×375	196	-	-	Real
WALT	Amodal	15M	-	2	36M	-	Real
AHP	Amodal	56,599	-	1	56,599	-	Real
DYCE	Amodal	5,500	1000×1000	79	85,975	27.7%	Real
OMLD	Amodal	13,000	384×512	40	-	-	Synthetic
CSD	Amodal	11,434	512×512	40	129,336	26.3%	Synthetic
MuLAn	LD	44,860	-	759	101,269	7.7%	Real
RefLade	RLD	430,488	1831×1437	12K	871,829	60.8%	Real

Comparison of RefLade with related existing datasets

Instance distribution of RefLade Dataset

Evaluation Protocol

The Human Preference Aligned (HPA) Score

Following human judgment, we evaluate the decomposition quality from three aspects:

Aspect 1: Preservation. Preserving original visible content.

\[\mathcal{S}_{\text{vis}} = \mathbb{E}_{(p, g) \sim \mathcal{D}} [ \text{LPIPS}(g_{\text{rgb}} \odot g_v,\, p_{\text{rgb}} \odot g_v) ] \]

Aspect 2: Completion. Generating reasonable completions for the occluded regions.

\[\mathcal{S}_{\text{gen}} = \mathbb{E}_{(p, g) \sim \mathcal{D}} \left[\cos\left( f(g_{\text{rgb}}) - f(g_{\text{rgb}} \odot g_v), \,f(p_{\text{rgb}}) - f(g_{\text{rgb}} \odot g_v) \right)\right]\]

Aspect 3: Faithfulness. The distributional similarity between predictions and ground-truth layers.

\[\hat{p} = p_{\text{rgb}} \odot p_a + i_{\text{bkgd}} \odot (1 - p_a), \quad \hat{g} = g_{\text{rgb}} \odot g_a + i_{\text{bkgd}} \odot (1 - g_a)\] \[\mathcal{S}_{\text{fid}} = \text{FID}\left( \left\{ \hat{p} \mid p \in \mathcal{D} \right\}, \left\{ \hat{g} \mid g \in \mathcal{D} \right\} \right)\]

Aggregation. We apply min-max normalization to each metric and then averaging them to produce the final HPA score.

Alignment with Human Preference

Human ELO vs. the HPA Score

	HPA	S_vis	S_gen	S_fid	S_vis + S_gen	S_fid + S_gen	S_vis + S_fid
Pearson correlation	0.96	0.90	0.96	0.94	0.95	0.95	0.94
Spearman correlation	1	0.60	0.98	0.67	0.92	0.97	1

Pearson and Spearman correlations with human ELO across different metrics

RefLayer: A Baseline Model

To establish a baseline, we formulate RLD as a conditional image generation problem and employ two decoders—a standard RGB decoder and a custom alpha decoder—to reconstruct the RGB content and the alpha transparency mask from the latent representation.

RefLayer model architecture.

Dataset	#layers	Foreground				Background
Dataset	#layers	HPA ↑	FID ↓	LPIPS ↓	DIR ↑	HPA ↑	FID ↓	LPIPS ↓	DIR ↑
MuLAn	50K	0.3852	22.68	0.1403	0.2031	0.3459	21.84	0.1588	0.6385
RefLade	50K	0.4629	10.98	0.1411	0.2543	0.5932	16.87	0.0520	0.7206
RefLade	100K	0.4621	11.27	0.1428	0.2589	0.5935	16.73	0.0530	0.7213
RefLade	200K	0.4631	10.99	0.1434	0.2547	0.5461	19.84	0.0552	0.6950
RefLade	400K	0.4678	10.66	0.1404	0.2575	0.5792	18.36	0.0493	0.7129
RefLade	1M	0.4685	11.10	0.1377	0.2561	0.5587	17.35	0.0730	0.7190
RefLadeQ	100K	0.4698	10.60	0.1378	0.2531	0.6657	12.99	0.0487	0.7721
RefLade+Q	1.1M	0.4813	10.50	0.1330	0.2652	0.6682	13.14	0.0437	0.7673

Benchmarking RefLade with Different Training Set and Scale. Results are reported on the RefLade testing set with multimodal text+box prompts

Qualitative Results

Model Comparison

We compare our RefLayer, trained on the RefLade dataset, with the same model trained on the MuLAn dataset and with Google Gemini 3 (Nano Banana Pro). Our model generally produces higher-quality predictions. The latest general-purpose generative models (Nano Banana Pro) have limitations in preservation and completion, and cannot produce true RGBA images with an alpha channel.

Comparison between RefLayer trained on the MuLAn dataset and the same model trained on our RefLade dataset

RLD vs Google Gemini 3 (Nano Banana Pro)

Comparison between RefLayer and Google Gemini 3 (Nano Banana Pro)

BibTeX

@inproceedings{rld,
    title     = {Referring Layer Decomposition},
    author    = {Chen, Fangyi and Shen, Yaojie and Xu, Lu and Yuan, Ye and Zhang, Shu and Niu, Yulei and Wen, Longyin},
    booktitle = {ICLR 2026 (Virtual)},
    year      = {2026},
    url       = {https://iclr.cc/virtual/2026/poster/10011003}
}