UniCustom: Unified Visual Conditioning
for Multi-Reference Image Generation

Yiyan Xu1*, Qiulin Wang2†‡, Wenjie Wang1†, Yunyao Mao2,
Xintao Wang2, Pengfei Wan2, Kun Gai2, Fuli Feng1

1University of Science and Technology China    2Kling Team, Kuaishou Technology

* Work done during internship at Kling Team, Kuaishou Technology.

† Corresponding authors.    ‡ Project Lead.


UniCustom visual overview

Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. We adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.


Architecture

UniCustom introduces a lightweight early-fusion module that injects VAE features into ViT features before VLM encoding, enabling hidden states that are both semantically addressable and appearance-aware.

Model Architecture

Overview of UniCustom. ViT and VAE features are fused before VLM encoding, producing semantically addressable and appearance-aware hidden states for DiT generation.

Training Strategy

The VLM is kept frozen throughout both stages; only the lightweight fusion layer and the DiT are optimized, preserving pretrained multimodal understanding while progressively learning fine-grained visual binding.

Training Strategy

Two-stage training strategy. The first stage progressively learns a unified visual representation that supports fine-grained reference encoding, semantic grounding, and reliable textual-to-visual binding through reconstruction-oriented multi-image pretraining. The second stage further adapts the diffusion backbone to reference-based image generation, enabling instruction-following synthesis with single or multiple reference images while preserving the learned grounding and binding abilities.

STAGE 01 · PRETRAINING

Unified Representation Learning

  • 20% Multi-image reconstruction
  • 20% Multi-image localization: segmentation
  • 20% Multi-image localization: bounding box
  • 20% Multi-image tiling
  • 20% Multi-image understanding (auxiliary)

Fusion layer + DiT optimized. LR = 5×10-5, 18K steps.

STAGE 02 · SUPERVISED FINETUNING

Multi-Reference Image Generation

  • 50% multi-reference image generation
  • 25% single-reference generation
  • 10% image editing (auxiliary)
  • 5% text-to-image generation (auxiliary)
  • 10% pretraining tasks (anti-forgetting)

Fusion layer frozen; only DiT optimized. LR = 1×10-5, 18K steps.


State-of-the-Art Among Open-Source Models

UniCustom achieves the best overall performance among open-source models on both OmniContext and MICo-Bench, with especially strong gains in multi-reference, scene-level, and compositional settings.

▸ OmniContext Benchmark

ModelsSingleMultipleSceneAverage
CharacterObjectCharacterObjectChar.+Obj.CharacterObjectChar.+Obj.
Closed-source Models
GPT-Image-29.249.659.139.508.949.279.258.959.24
Nano Banana 28.989.448.859.268.818.808.598.678.92
Open-source Models
FLUX-Kontext [dev]7.097.352.495.464.462.863.893.404.62
UNO7.517.994.567.416.534.005.765.776.19
USO7.717.682.917.275.613.886.496.045.95
BAGEL7.127.705.867.076.794.986.056.036.45
OmniGen28.047.877.227.767.517.527.307.577.60
Qwen-Image-Edit-25098.119.017.858.537.605.887.387.127.69
LongCat-Image-Edit8.248.636.698.076.995.806.736.987.26
UniCustom (Ours) 8.067.517.787.997.867.867.727.927.84

* "Char." and "Obj." denote "Character" and "Object", respectively.

▸ MICo-Bench

ModelObjectPersonHOIDe&ReOverall
Closed-source Models
GPT-Image-266.9360.5659.0262.6961.78
Nano Banana 268.4665.8864.8870.2867.42
Open-source Models
FLUX-Kontext [dev]21.4014.3312.677.2412.51
UNO42.2015.1527.9941.9332.43
USO38.1815.5720.2535.9127.37
BAGEL39.2316.7720.9946.7531.62
Qwen-Image-Edit-250937.3328.0122.0610.5321.67
LongCat-Image-Edit40.5519.4024.4714.8622.78
UniCustom (Ours) 54.3018.1240.5150.2941.71

* "HOI" and "De&Re" denote "Human-Object Interaction" and "Decomposition & Recomposition", respectively.


01

Identifying the Grounding–Binding Gap

We identify the grounding--binding gap in VLM-enhanced diffusion models for multi-reference image generation. In existing decoupled conditioning designs, the DiT must implicitly associate VLM-encoded subject semantics with separately injected appearance features, which becomes unreliable with multiple references.

02

Unified Visual Conditioning

We propose UniCustom, a unified visual conditioning framework that makes reference appearances semantically accessible. By fusing ViT and VAE features before VLM encoding, UniCustom produces hidden states that jointly encode the referred subject and its fine-grained visual details, thereby providing the DiT with more explicit semantic--appearance correspondences.

03

Two-Stage Training + Slot-wise Binding Regularization

We introduce a two-stage training strategy with slot-wise binding regularization to progressively learn reference-specific appearance preservation and adapt it to multi-reference generation. The reconstruction-oriented pretraining stage achieves a single-image reconstruction PSNR close to 30 dB, indicating that fused VLM hidden states can serve as an effective conduit for transmitting low-level details from VAE features to the DiT. Extensive experiments on two multi-reference image generation benchmarks further demonstrate that UniCustom outperforms existing methods.


If you find UniCustom useful in your research, please consider citing our paper.

BibTeX
@article{xu2026unicustom,
  title={UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation},
  author={Xu, Yiyan and Wang, Qiulin and Wang, Wenjie and Mao, Yunyao and Wang, Xintao and Wan, Pengfei and Gai, Kun and Feng, Fuli},
  booktitle={arXiv preprint arXiv:2605.12088},
  year={2026}
}