Referring Layer Decomposition

Intelligent Editing Team, Intelligent Creation, ByteDance Inc.

Original

Prompt

RefLayer

Layered Result

Abstract

Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image–layer–prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.

RefLade: Data Engine, Dataset and Evaluation Protocol

Data Engine

We introduce a scalable and modular automated data engine capable of generating diverse, realistic, and high-fidelity RGBA layers from natural images at scale.

Overview of the data engine

Dataset

Image source

Dataset Task # Images Average
Resolutions
# Cls # Instances Occlusion
Rate
Image
Source
SAIL-VOS Amodal 111,654 800×1280 162 1,896,296 56.3% Synthetic
OVD Amodal 34,100 500×375 196 - - Real
WALT Amodal 15M - 2 36M - Real
AHP Amodal 56,599 - 1 56,599 - Real
DYCE Amodal 5,500 1000×1000 79 85,975 27.7% Real
OMLD Amodal 13,000 384×512 40 - - Synthetic
CSD Amodal 11,434 512×512 40 129,336 26.3% Synthetic
MuLAn LD 44,860 - 759 101,269 7.7% Real
RefLade RLD 430,488 1831×1437 12K 871,829 60.8% Real

Comparison of RefLade with related existing datasets

Instance distribution of RefLade Dataset

Evaluation Protocol

The Human Preference Aligned (HPA) Score

Following human judgment, we evaluate the decomposition quality from three aspects:

Aspect 1: Preservation. Preserving original visible content.

\[\mathcal{S}_{\text{vis}} = \mathbb{E}_{(p, g) \sim \mathcal{D}} [ \text{LPIPS}(g_{\text{rgb}} \odot g_v,\, p_{\text{rgb}} \odot g_v) ] \]

Aspect 2: Completion. Generating reasonable completions for the occluded regions.

\[\mathcal{S}_{\text{gen}} = \mathbb{E}_{(p, g) \sim \mathcal{D}} \left[\cos\left( f(g_{\text{rgb}}) - f(g_{\text{rgb}} \odot g_v), \,f(p_{\text{rgb}}) - f(g_{\text{rgb}} \odot g_v) \right)\right]\]

Aspect 3: Faithfulness. The distributional similarity between predictions and ground-truth layers.

\[\hat{p} = p_{\text{rgb}} \odot p_a + i_{\text{bkgd}} \odot (1 - p_a), \quad \hat{g} = g_{\text{rgb}} \odot g_a + i_{\text{bkgd}} \odot (1 - g_a)\] \[\mathcal{S}_{\text{fid}} = \text{FID}\left( \left\{ \hat{p} \mid p \in \mathcal{D} \right\}, \left\{ \hat{g} \mid g \in \mathcal{D} \right\} \right)\]

Aggregation. We apply min-max normalization to each metric and then averaging them to produce the final HPA score.

Alignment with Human Preference

Human ELO vs. the HPA Score

HPA Svis Sgen Sfid Svis + Sgen Sfid + Sgen Svis + Sfid
Pearson correlation 0.96 0.90 0.96 0.94 0.95 0.95 0.94
Spearman correlation 1 0.60 0.98 0.67 0.92 0.97 1

Pearson and Spearman correlations with human ELO across different metrics


RefLayer: A Baseline Model

To establish a baseline, we formulate RLD as a conditional image generation problem and employ two decoders—a standard RGB decoder and a custom alpha decoder—to reconstruct the RGB content and the alpha transparency mask from the latent representation.

RefLayer model architecture.

Dataset #layers Foreground Background
HPA ↑ FID ↓ LPIPS ↓ DIR ↑ HPA ↑ FID ↓ LPIPS ↓ DIR ↑
MuLAn 50K 0.3852 22.68 0.1403 0.2031 0.3459 21.84 0.1588 0.6385
RefLade 50K 0.4629 10.98 0.1411 0.2543 0.5932 16.87 0.0520 0.7206
RefLade 100K 0.4621 11.27 0.1428 0.2589 0.5935 16.73 0.0530 0.7213
RefLade 200K 0.4631 10.99 0.1434 0.2547 0.5461 19.84 0.0552 0.6950
RefLade 400K 0.4678 10.66 0.1404 0.2575 0.5792 18.36 0.0493 0.7129
RefLade 1M 0.4685 11.10 0.1377 0.2561 0.5587 17.35 0.0730 0.7190
RefLadeQ 100K 0.4698 10.60 0.1378 0.2531 0.6657 12.99 0.0487 0.7721
RefLade+Q 1.1M 0.4813 10.50 0.1330 0.2652 0.6682 13.14 0.0437 0.7673

Benchmarking RefLade with Different Training Set and Scale. Results are reported on the RefLade testing set with multimodal text+box prompts


Qualitative Results

Qualitative
Qualitative

Model Comparison

We compare our RefLayer, trained on the RefLade dataset, with the same model trained on the MuLAn dataset and with Google Gemini 3 (Nano Banana Pro). Our model generally produces higher-quality predictions. The latest general-purpose generative models (Nano Banana Pro) have limitations in preservation and completion, and cannot produce true RGBA images with an alpha channel.

RLD vs MuLAn

Comparison between RefLayer trained on the MuLAn dataset and the same model trained on our RefLade dataset

RLD vs Google Gemini 3 (Nano Banana Pro)

Comparison between RefLayer and Google Gemini 3 (Nano Banana Pro)

BibTeX

@inproceedings{rld,
    title     = {Referring Layer Decomposition},
    author    = {Chen, Fangyi and Shen, Yaojie and Xu, Lu and Yuan, Ye and Zhang, Shu and Niu, Yulei and Wen, Longyin},
    booktitle = {ICLR 2026 (Virtual)},
    year      = {2026},
    url       = {https://iclr.cc/virtual/2026/poster/10011003}
}