Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

Anonymous Authors

Supplementary Website for ECCV 2024

Mixture-of-Attention (MoA) architecture enables multi-subject personalized generation with subject-context disentanglement, without any predefined layout.

Interpolate start reference image.

Start Frame

Loading...

Interpolation end reference image.

End Frame

Drag to traverse the initial random noise, which changes the context consistently across different subject pairs. **Github pages can be slow to load the images. Slide a few times to see smooth transition!

Abstract

We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between dual attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offers a more disentangled subject-context control that was previously unattainable.

Mixture-of-Attention

Our key observation is that, existing personalization methods often need to trade-off between "prior preservation" for better prompt consistency and "personalization finetuning" for idenity fidelity. We augment the attention layer using an architecture inspired by the Mixture-of-Expert (MoE) where a router is introduced to distribute tasks among different experts. In our case, we keep a "prior expert" frozen during finetuning to preserve the prior, and finetune a personalized expert.

Fig2.

This MoA is scattered across all attention layers in a pretrained U-Net, and finetuned on a small dataset (FFHQ). In the paper, we also disucss how the router is trained, and propose a regularization term that encourages the personalization branch to minimally affect the overall image.

Fig3.

When we visualize the router predictions of MoA, we can see that the routers put the background pixels to the prior branch, and most of the foreground pixels to the personalization branch. This behavior explains why MoA can achieve disentangled subject-context control. The detailed behavior of the router differs across layers and timesteps. This allows the personalization branch to focus on different regions within the subject (i.e. the face, the body, and so on) at different timesteps.

Fig3.

Novel Capabilities

Identity Swap (Face and Body)

MoA is able to handle different body shapes. Notice in the Yokozuna (3rd row) results, he blocks the background completely, while the Dalle-3 generated man (last row), we can see through the gap between his arm and body.

Teaser.

Consistent Character Storytelling

MoA enables storytelling with consistent character! 🤖 With the increase in AIGC characters, creating a story with consistent character across frames remains challenging. With MoA, this is simple and we can also easily combine characters to enrich the story!

Teaser.

More Results

MoA preserves the context from the prior (top row) and is able to seamlessly handle occlusions (e.g. 🧋 and 🌹).

In the case of multiple subjects, MoA is not only able to preserve the background, object, but also their interaction (e.g. 🤝 and 📕).

MoA enables diverse multi-subject driven generation (i.e. layout, background, interaction) all throught simple text prompts.